1. Introduction

Maverik is currently facing a significant challenge in preparing comprehensive ROI documents for their new store openings. This challenge stems from the difficulty of accurately predicting first-year sales, which in turn hinders the company's strategic financial planning and decision-making processes, ultimately affecting their sustainable growth. To address this issue, the primary analytical goal is to develop a predictive model capable of providing precise daily forecasts for merchandise sales, food sales, unleaded fuel sales, and diesel fuel sales for new Maverik stores during their first year of operation.

The Modeling is dedicated to achieving this objective. Its purpose is to build a model that can forecast merchandise sales, food sales, unleaded fuel sales, and diesel fuel sales for the next 365 days using historical data as a basis. The notebook's focus is on evaluating various time series forecasting models to determine the most suitable option that aligns with Maverik's specific needs. Since there are four target variables involved, this constitutes a multivariate time series analysis. The two models tested are Prophet and XGBoost.

2. Exploratory Data Analysis

In this section, we provide an overview of the Exploratory Data Analysis carried out as an integral part of the project. Following that, we delve into the Modeling phase, where we assess different models to identify the most suitable one for our future predictions.

This section also encompasses strategies for data preparation, which involves managing variable transformations, conducting feature engineering, and addressing missing values in the dataset.

2.1. Importing Packages

In [74]:
#pip install xgboost
In [75]:
import numpy as np
import pandas as pd
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")
import matplotlib.pyplot as plt
from pandas.plotting import table
import plotly.express as px
import matplotlib.ticker as mtick
In [76]:
import plotly.io as pio
pio.renderers.default = 'notebook_connected'
In [77]:
import xgboost as xgb
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import TimeSeriesSplit
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
In [78]:
from prophet import Prophet

2.2. Reading Data

In [79]:
q_data= pd.read_csv("qualitative_data_msba.csv")
t_data=pd.read_csv("time_series_data_msba.csv")

3. Summarizing & Understading the Data

In [80]:
q_data.head()
Out[80]:
Unnamed: 0 open_year square_feet front_door_count years_since_last_project parking_spaces lottery freal bonfire_grill pizza ... rv_lanes_fueling_positions_2 hi_flow_rv_lanes_layout hi_flow_rv_lanes_stack_type non_24_hour self_check_out mens_toilet_count mens_urinal_count womens_toilet_count womens_sink_count site_id_msba
0 1 2021 5046 2 2 38 Yes Yes Yes No ... 6 Stack HF/RV No Yes 2 2 6 2 21560
1 2 2021 5046 2 2 39 No Yes Yes Yes ... 4 Combo HF/RV No Yes 5 5 10 4 21980
2 3 2021 5046 2 2 35 Yes Yes Yes Yes ... 5 In-Line None No Yes 3 2 4 1 22015
3 4 2021 5046 2 2 36 No Yes Yes Yes ... 4 Combo HF/RV No Yes 3 3 6 2 22085
4 5 2021 5046 2 2 25 Yes Yes Yes No ... 0 NaN NaN No Yes 0 0 0 0 22120

5 rows × 55 columns

In [81]:
t_data.head()
Out[81]:
Unnamed: 0 capital_projects.soft_opening_date calendar.calendar_day_date calendar.fiscal_week_id_for_year calendar.day_of_week calendar_information.holiday calendar_information.type_of_day daily_yoy_ndt.total_inside_sales daily_yoy_ndt.total_food_service diesel unleaded site_id_msba
0 1 2022-06-14 2022-06-17 25 Friday NONE WEEKDAY 2168.2920 861.6930 722.7745 1425.1020 24535
1 2 2022-06-14 2022-06-22 25 Wednesday NONE WEEKDAY 2051.5635 808.0275 730.4850 1436.2740 24535
2 3 2022-06-14 2022-06-23 25 Thursday NONE WEEKDAY 2257.5000 966.4410 895.7970 1594.3725 24535
3 4 2022-06-14 2022-06-26 26 Sunday NONE WEEKEND 1520.5925 542.3250 584.2900 1124.9280 24535
4 5 2022-06-14 2022-06-27 26 Monday NONE WEEKDAY 1897.6930 771.4525 852.2605 1640.2540 24535

3.1 Removing 1st column from both the dataframes named 'Unnamed: 0'

In [82]:
q_data= q_data.drop("Unnamed: 0", axis=1)
t_data= t_data.drop("Unnamed: 0", axis=1)
In [83]:
q_data.head()
Out[83]:
open_year square_feet front_door_count years_since_last_project parking_spaces lottery freal bonfire_grill pizza cinnabon ... rv_lanes_fueling_positions_2 hi_flow_rv_lanes_layout hi_flow_rv_lanes_stack_type non_24_hour self_check_out mens_toilet_count mens_urinal_count womens_toilet_count womens_sink_count site_id_msba
0 2021 5046 2 2 38 Yes Yes Yes No No ... 6 Stack HF/RV No Yes 2 2 6 2 21560
1 2021 5046 2 2 39 No Yes Yes Yes No ... 4 Combo HF/RV No Yes 5 5 10 4 21980
2 2021 5046 2 2 35 Yes Yes Yes Yes No ... 5 In-Line None No Yes 3 2 4 1 22015
3 2021 5046 2 2 36 No Yes Yes Yes No ... 4 Combo HF/RV No Yes 3 3 6 2 22085
4 2021 5046 2 2 25 Yes Yes Yes No No ... 0 NaN NaN No Yes 0 0 0 0 22120

5 rows × 54 columns

In [84]:
t_data.head()
Out[84]:
capital_projects.soft_opening_date calendar.calendar_day_date calendar.fiscal_week_id_for_year calendar.day_of_week calendar_information.holiday calendar_information.type_of_day daily_yoy_ndt.total_inside_sales daily_yoy_ndt.total_food_service diesel unleaded site_id_msba
0 2022-06-14 2022-06-17 25 Friday NONE WEEKDAY 2168.2920 861.6930 722.7745 1425.1020 24535
1 2022-06-14 2022-06-22 25 Wednesday NONE WEEKDAY 2051.5635 808.0275 730.4850 1436.2740 24535
2 2022-06-14 2022-06-23 25 Thursday NONE WEEKDAY 2257.5000 966.4410 895.7970 1594.3725 24535
3 2022-06-14 2022-06-26 26 Sunday NONE WEEKEND 1520.5925 542.3250 584.2900 1124.9280 24535
4 2022-06-14 2022-06-27 26 Monday NONE WEEKDAY 1897.6930 771.4525 852.2605 1640.2540 24535

3.2 Checking the shape of both the dataframes

In [85]:
q_data.shape
Out[85]:
(37, 54)
In [86]:
t_data.shape
Out[86]:
(13908, 11)

3.3 Obtaining description of the data in the datafame

In [87]:
q_data[['square_feet','x1_mile_pop','x1_mile_emp','x1_mile_income','x1_2_mile_pop','x1_2_mile_emp']].describe()
Out[87]:
square_feet x1_mile_pop x1_mile_emp x1_mile_income x1_2_mile_pop x1_2_mile_emp
count 37.00000 37.000000 37.000000 37.000000 37.000000 37.000000
mean 4970.27027 6703.567568 4757.648649 53300.378378 1833.108108 1514.135135
std 575.93121 5694.011350 4697.168291 24333.027254 1915.140476 2489.423094
min 2933.00000 0.000000 56.000000 0.000000 0.000000 31.000000
25% 5046.00000 1984.000000 1771.000000 39538.000000 262.000000 386.000000
50% 5046.00000 5574.000000 3895.000000 46356.000000 1003.000000 1037.000000
75% 5046.00000 11269.000000 6002.000000 73519.000000 2686.000000 1616.000000
max 6134.00000 18692.000000 26077.000000 110957.000000 5923.000000 15403.000000
In [88]:
t_data[['daily_yoy_ndt.total_inside_sales', 'daily_yoy_ndt.total_food_service', 'diesel', 'unleaded']].describe()
Out[88]:
daily_yoy_ndt.total_inside_sales daily_yoy_ndt.total_food_service diesel unleaded
count 13908.000000 13908.000000 13908.000000 13908.000000
mean 2846.537988 759.922326 1702.585227 2382.091588
std 981.299870 341.578220 2161.208192 1025.518658
min 0.000000 0.000000 0.000000 240.180500
25% 2181.156250 521.087875 383.062750 1654.149000
50% 2693.976250 697.434500 1018.920000 2256.677500
75% 3325.306250 924.282625 2283.297625 2928.254000
max 7172.466000 2531.662000 20853.952000 8077.233500

3.4 Identified Zero Variance Variables

The following columns exhibit consistent values throughout and do not demonstrate significant correlation with the target variables. Therefore, they can be considered as zero variance variables and may be removed from the dataset:

1. front_door_count
2. godfather_s_pizza
3. diesel
4. car_wash
5. ev_charging
6. non_24_hour
7. self_check_out
In [89]:
q_data.drop(['front_door_count','godfather_s_pizza','car_wash','ev_charging','non_24_hour','self_check_out','diesel'], axis=1,inplace=True)

3.5 Checking the shape of q_data after columns drop

In [90]:
q_data.shape
Out[90]:
(37, 47)

3.6 Finding NA and Categorical Values in Qualitative Dataframe (q_data)

In [91]:
na_values = q_data.isna().sum()
na_values[na_values > 0]
Out[91]:
rv_lanes_layout                14
rv_lanes_stack_type            14
hi_flow_lanes_layout           15
hi_flow_lanes_stack_type       15
hi_flow_rv_lanes_layout        14
hi_flow_rv_lanes_stack_type    14
dtype: int64
As per the above result, 6 columns of qualitative dataframe has N/A value. It means that Maverik does not provide these features in few of it stores.
In [92]:
cat_col= q_data.select_dtypes(include="object").columns.tolist()
cat_col
Out[92]:
['lottery',
 'freal',
 'bonfire_grill',
 'pizza',
 'cinnabon',
 'ethanol_free',
 'hi_flow_lanes',
 'rv_lanes',
 'hi_flow_rv_lanes',
 'def',
 'cat_scales',
 'rv_dumps',
 'propane',
 'traditional_forecourt_layout',
 'traditional_forecourt_stack_type',
 'rv_lanes_layout',
 'rv_lanes_stack_type',
 'hi_flow_lanes_layout',
 'hi_flow_lanes_stack_type',
 'hi_flow_rv_lanes_layout',
 'hi_flow_rv_lanes_stack_type']

3.6.1 Replacing NA values with None

In [93]:
columns_with_na = q_data.columns[q_data.isna().any()].tolist()

for column in columns_with_na:
    mode_value = q_data[column].mode()[0]
    q_data[column].fillna("None", inplace=True)

3.6.2 Factorizing the Categorical columns

In [94]:
for column in cat_col:
    q_data[column] = pd.factorize(q_data[column])[0]
In [95]:
import pandas as pd

# Define bins and labels
bins = [20, 30, 40, 50]
labels = [1, 2, 3]

# Bin the values in the 'parking_spaces' column
q_data['parking_spaces'] = pd.cut(q_data['parking_spaces'], bins=bins, labels=labels, include_lowest=True).astype(int)
In [96]:
q_data.head()
Out[96]:
open_year square_feet years_since_last_project parking_spaces lottery freal bonfire_grill pizza cinnabon ethanol_free ... hi_flow_lanes_stack_type hi_flow_lanes_fueling_positions_2 rv_lanes_fueling_positions_2 hi_flow_rv_lanes_layout hi_flow_rv_lanes_stack_type mens_toilet_count mens_urinal_count womens_toilet_count womens_sink_count site_id_msba
0 2021 5046 2 2 0 0 0 0 0 0 ... 0 4 6 0 0 2 2 6 2 21560
1 2021 5046 2 2 1 0 0 1 0 1 ... 0 9 4 1 0 5 5 10 4 21980
2 2021 5046 2 2 0 0 0 1 0 0 ... 1 0 5 2 1 3 2 4 1 22015
3 2021 5046 2 2 1 0 0 1 0 0 ... 0 5 4 1 0 3 3 6 2 22085
4 2021 5046 2 1 0 0 0 0 0 1 ... 1 0 0 3 1 0 0 0 0 22120

5 rows × 47 columns

3.7 Finding NA and Categorical Values in Time Series Dataframe (t_data)

In [97]:
na_values = t_data.isna().sum()
na_values[na_values > 0]
Out[97]:
Series([], dtype: int64)

There are no columns in time series data that has N/A value

3.8 Extracting Year, Month and Date from Calendar_day_date

In [98]:
t_data['calendar.calendar_day_date'] = pd.to_datetime(t_data['calendar.calendar_day_date'])

t_data['calendar_year'] = t_data['calendar.calendar_day_date'].dt.year
t_data['calendar_month'] = t_data['calendar.calendar_day_date'].dt.month
t_data['calendar_day'] = t_data['calendar.calendar_day_date'].dt.day
In [99]:
t_data.head()
Out[99]:
capital_projects.soft_opening_date calendar.calendar_day_date calendar.fiscal_week_id_for_year calendar.day_of_week calendar_information.holiday calendar_information.type_of_day daily_yoy_ndt.total_inside_sales daily_yoy_ndt.total_food_service diesel unleaded site_id_msba calendar_year calendar_month calendar_day
0 2022-06-14 2022-06-17 25 Friday NONE WEEKDAY 2168.2920 861.6930 722.7745 1425.1020 24535 2022 6 17
1 2022-06-14 2022-06-22 25 Wednesday NONE WEEKDAY 2051.5635 808.0275 730.4850 1436.2740 24535 2022 6 22
2 2022-06-14 2022-06-23 25 Thursday NONE WEEKDAY 2257.5000 966.4410 895.7970 1594.3725 24535 2022 6 23
3 2022-06-14 2022-06-26 26 Sunday NONE WEEKEND 1520.5925 542.3250 584.2900 1124.9280 24535 2022 6 26
4 2022-06-14 2022-06-27 26 Monday NONE WEEKDAY 1897.6930 771.4525 852.2605 1640.2540 24535 2022 6 27

4. Analysis On Qualitative and Time Series Data

4.1 Analysing the 4 target variables(In-Store, Food, Diesel, Unleaded)

4.1.1 Plotting diesel and unleaded sales on Days of the week for the year 2021,2022 and 2023

In [100]:
t_data['calendar.calendar_day_date'] = pd.to_datetime(t_data['calendar.calendar_day_date'])

t_data.set_index('calendar.calendar_day_date', inplace=True)
Use the slider to drill through year,month,date and time
In [101]:
average_diesel_sales = t_data.groupby(t_data.index)['diesel'].mean()
average_unleaded_sales = t_data.groupby(t_data.index)['unleaded'].mean()


average_sales_df = pd.DataFrame({'Diesel': average_diesel_sales, 'Unleaded': average_unleaded_sales})


fig = px.line(average_sales_df, x=average_sales_df.index, y=average_sales_df.columns,
              title='Average Diesel and Unleaded Sales Over Time')
fig.update_xaxes(title='Date', rangeslider_visible=True)
fig.update_yaxes(title='Average Sales')
fig.show()

4.1.2 Plotting Inside and Food sales on Days of the week for the year 2021,2022 and 2023

In [102]:
average_inside_sales = t_data.groupby(t_data.index)['daily_yoy_ndt.total_inside_sales'].mean()
average_food_sales = t_data.groupby(t_data.index)['daily_yoy_ndt.total_food_service'].mean()


average_sales_df = pd.DataFrame({'daily_yoy_ndt.total_inside_sales': average_inside_sales, 'daily_yoy_ndt.total_food_service': average_food_sales})


fig = px.line(average_sales_df, x=average_sales_df.index, y=average_sales_df.columns,
              title='Average Inside and Food Sales Over Time')
fig.update_xaxes(title='Date', rangeslider_visible=True)
fig.update_yaxes(title='Average Sales')
fig.show()

4.1.3 Sales of all target variables based on Calender Week

In [103]:
t_data['calendar.fiscal_week_id_for_year'].min(), t_data['calendar.fiscal_week_id_for_year'].max()
Out[103]:
(1, 52)
In [104]:
df_time_aggregate = t_data.groupby('calendar.fiscal_week_id_for_year')['daily_yoy_ndt.total_inside_sales', 'daily_yoy_ndt.total_food_service', 'diesel', 'unleaded'].sum().reset_index()
Lets check if we have more than one sales data for a date
In [105]:
temp = t_data.groupby('calendar.fiscal_week_id_for_year')['daily_yoy_ndt.total_inside_sales'].size()
temp[temp > 1].sort_values(ascending=False)
Out[105]:
calendar.fiscal_week_id_for_year
20    273
38    270
13    270
33    270
28    270
24    270
21    270
49    270
45    270
19    269
11    268
46    268
39    268
32    268
31    268
30    268
29    268
4     268
2     268
26    268
25    268
5     268
42    268
50    268
18    268
15    268
9     268
41    266
48    266
37    266
43    266
44    266
51    266
40    266
47    266
1     266
27    266
36    266
35    266
34    266
23    266
22    266
17    266
16    266
14    266
12    266
10    266
8     266
7     266
6     266
3     266
52    266
Name: daily_yoy_ndt.total_inside_sales, dtype: int64
Indexing with Time Series Data
In [106]:
df_time_aggregate = df_time_aggregate.set_index('calendar.fiscal_week_id_for_year')
df_time_aggregate.index
Out[106]:
Int64Index([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
            18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34,
            35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51,
            52],
           dtype='int64', name='calendar.fiscal_week_id_for_year')
In [107]:
plt.figure(figsize=(40, 50));
df_time_aggregate.plot();
<Figure size 2880x3600 with 0 Axes>

4.1.4 Identifying Sale of Diesel and Unleaded during Weekend and Weekdays of the years 2021, 2022, 2023

In [108]:
for year in [2021, 2022, 2023]:
    sales_year = t_data[t_data['calendar_year'] == year]
    pivot_table = sales_year.pivot_table(index='calendar_information.type_of_day',
                                         values=['diesel', 'unleaded'],
                                         aggfunc='sum')
    ax = pivot_table.plot(kind='bar', figsize=(10, 6))
    plt.title(f'Diesel vs. Unleaded Sales by Type of Day - {year}')
    plt.xlabel('Type of Day')
    plt.ylabel('Total Sales')
    ax.yaxis.set_major_formatter(mtick.FuncFormatter(lambda x, p: format(int(x), ',')))
    ax.set_xticklabels(pivot_table.index, rotation=0)

    plt.legend(['Diesel', 'Unleaded'])
    plt.show()

4.1.5 Identifying Sale of Merchandise and Food during Weekend and Weekdays for the years 2021,2022,2023

In [109]:
for year in [2021, 2022, 2023]:
    sales_year = t_data[t_data['calendar_year'] == year]
    pivot_table = sales_year.pivot_table(index='calendar_information.type_of_day',
                                         values=['daily_yoy_ndt.total_inside_sales', 'daily_yoy_ndt.total_food_service'],
                                         aggfunc='sum')
    ax = pivot_table.plot(kind='bar', figsize=(10, 6))
    plt.title(f'Merchandise vs. Food Sales by Type of Day - {year}')
    plt.xlabel('Type of Day')
    plt.ylabel('Total Sales')
    ax.yaxis.set_major_formatter(mtick.FuncFormatter(lambda x, p: format(int(x), ',')))

4.1.6 Plotting diesel and unleaded sales on Holidays for the year 2021,2022 and 2023

In [110]:
#2021
t_data_2021 = t_data[t_data['calendar_year'] == 2021]
sales_by_holiday_2021 = t_data_2021.groupby('calendar_information.holiday')[['diesel', 'unleaded']].sum()
sales_by_holiday_2021 = sales_by_holiday_2021[sales_by_holiday_2021.index != 'NONE']
holidays_2021 = sales_by_holiday_2021.index.tolist()
diesel_sales_2021 = sales_by_holiday_2021['diesel'].tolist()
unleaded_sales_2021 = sales_by_holiday_2021['unleaded'].tolist()

plt.figure(figsize=(15, 6))
bar_width = 0.35
r1 = range(len(holidays_2021))
r2 = [x + bar_width for x in r1]
plt.bar(r1, diesel_sales_2021, color='blue', width=bar_width, edgecolor='grey', label='Diesel')
plt.bar(r2, unleaded_sales_2021, color='orange', width=bar_width, edgecolor='grey', label='Unleaded')
plt.xlabel('Holiday')
plt.ylabel('Total Sales')
plt.title('Total Diesel and Unleaded Sales on Each Holiday in 2021')
plt.xticks([r + bar_width/2 for r in r1], holidays_2021, rotation=90)
plt.legend()
plt.show()

#2022
t_data_2022 = t_data[t_data['calendar_year'] == 2022]
sales_by_holiday_2022 = t_data_2022.groupby('calendar_information.holiday')[['diesel', 'unleaded']].sum()
sales_by_holiday_2022 = sales_by_holiday_2022[sales_by_holiday_2022.index != 'NONE']
holidays_2022 = sales_by_holiday_2022.index.tolist()
diesel_sales_2022 = sales_by_holiday_2022['diesel'].tolist()
unleaded_sales_2022 = sales_by_holiday_2022['unleaded'].tolist()

plt.figure(figsize=(15, 6))
bar_width = 0.35
r1 = range(len(holidays_2022))
r2 = [x + bar_width for x in r1]
plt.bar(r1, diesel_sales_2022, color='blue', width=bar_width, edgecolor='grey', label='Diesel')
plt.bar(r2, unleaded_sales_2022, color='orange', width=bar_width, edgecolor='grey', label='Unleaded')
plt.xlabel('Holiday')
plt.ylabel('Total Sales')
plt.title('Total Diesel and Unleaded Sales on Each Holiday in 2022')
plt.xticks([r + bar_width/2 for r in r1], holidays_2022, rotation=90)
plt.legend()
plt.show()

#2023
t_data_2023 = t_data[t_data['calendar_year'] == 2023]
sales_by_holiday_2023 = t_data_2023.groupby('calendar_information.holiday')[['diesel', 'unleaded']].sum()
sales_by_holiday_2023 = sales_by_holiday_2023[sales_by_holiday_2023.index != 'NONE']
holidays_2023 = sales_by_holiday_2023.index.tolist()
diesel_sales_2023 = sales_by_holiday_2023['diesel'].tolist()
unleaded_sales_2023 = sales_by_holiday_2023['unleaded'].tolist()

plt.figure(figsize=(15, 6))
bar_width = 0.35
r1 = range(len(holidays_2023))
r2 = [x + bar_width for x in r1]
plt.bar(r1, diesel_sales_2023, color='blue', width=bar_width, edgecolor='grey', label='Diesel')
plt.bar(r2, unleaded_sales_2023, color='orange', width=bar_width, edgecolor='grey', label='Unleaded')
plt.xlabel('Holiday')
plt.ylabel('Total Sales')
plt.title('Total Diesel and Unleaded Sales on Each Holiday in 2023')
plt.xticks([r + bar_width/2 for r in r1], holidays_2023, rotation=90)
plt.legend()
plt.show()
The above graphs shows that the sale of Unleaded is more than diesel on the Holidays.

4.1.7 Plotting in store and food sales on Holidays for the year 2021,2022 and 2023

In [111]:
#2021
t_data_2021 = t_data[t_data['calendar_year'] == 2021]
sales_by_holiday_2021 = t_data_2021.groupby('calendar_information.holiday')[['daily_yoy_ndt.total_inside_sales', 'daily_yoy_ndt.total_food_service']].sum()
sales_by_holiday_2021 = sales_by_holiday_2021[sales_by_holiday_2021.index != 'NONE']
holidays_2021 = sales_by_holiday_2021.index.tolist()
inStore_sales_2021 = sales_by_holiday_2021['daily_yoy_ndt.total_inside_sales'].tolist()
Food_sales_2021 = sales_by_holiday_2021['daily_yoy_ndt.total_food_service'].tolist()

plt.figure(figsize=(15, 6))
bar_width = 0.35
r1 = range(len(holidays_2021))
r2 = [x + bar_width for x in r1]
plt.bar(r1, inStore_sales_2021, color='blue', width=bar_width, edgecolor='grey', label='InStore')
plt.bar(r2, Food_sales_2021, color='orange', width=bar_width, edgecolor='grey', label='Food')
plt.xlabel('Holiday')
plt.ylabel('Total Sales')
plt.title('Total InStore and Food Sales on Each Holiday in 2021')
plt.xticks([r + bar_width/2 for r in r1], holidays_2021, rotation=90)
plt.legend()
plt.show()

#2022
t_data_2022 = t_data[t_data['calendar_year'] == 2022]
sales_by_holiday_2022 = t_data_2022.groupby('calendar_information.holiday')[['daily_yoy_ndt.total_inside_sales', 'daily_yoy_ndt.total_food_service']].sum()
sales_by_holiday_2022 = sales_by_holiday_2022[sales_by_holiday_2022.index != 'NONE']
holidays_2022 = sales_by_holiday_2022.index.tolist()
inStore_sales_2022 = sales_by_holiday_2022['daily_yoy_ndt.total_inside_sales'].tolist()
Food_sales_2022 = sales_by_holiday_2022['daily_yoy_ndt.total_food_service'].tolist()

plt.figure(figsize=(15, 6))
bar_width = 0.35
r1 = range(len(holidays_2022))
r2 = [x + bar_width for x in r1]
plt.bar(r1, inStore_sales_2022, color='blue', width=bar_width, edgecolor='grey', label='InStore')
plt.bar(r2, Food_sales_2022, color='orange', width=bar_width, edgecolor='grey', label='Food')
plt.xlabel('Holiday')
plt.ylabel('Total Sales')
plt.title('Total InStore and Food Sales on Each Holiday in 2022')
plt.xticks([r + bar_width/2 for r in r1], holidays_2022, rotation=90)
plt.legend()
plt.show()

#2023
t_data_2023 = t_data[t_data['calendar_year'] == 2023]
sales_by_holiday_2023 = t_data_2023.groupby('calendar_information.holiday')[['daily_yoy_ndt.total_inside_sales', 'daily_yoy_ndt.total_food_service']].sum()
sales_by_holiday_2023 = sales_by_holiday_2023[sales_by_holiday_2023.index != 'NONE']
holidays_2023 = sales_by_holiday_2023.index.tolist()
inStore_sales_2023 = sales_by_holiday_2023['daily_yoy_ndt.total_inside_sales'].tolist()
Food_sales_2023 = sales_by_holiday_2023['daily_yoy_ndt.total_food_service'].tolist()

plt.figure(figsize=(15, 6))
bar_width = 0.35
r1 = range(len(holidays_2023))
r2 = [x + bar_width for x in r1]
plt.bar(r1, inStore_sales_2023, color='blue', width=bar_width, edgecolor='grey', label='InStore')
plt.bar(r2, Food_sales_2023, color='orange', width=bar_width, edgecolor='grey', label='Food')
plt.xlabel('Holiday')
plt.ylabel('Total Sales')
plt.title('Total InStore and Food Sales on Each Holiday in 2021')
plt.xticks([r + bar_width/2 for r in r1], holidays_2023, rotation=90)
plt.legend()
plt.show()
The above graphs shows that the sale of In-store merchandise is more than the sale of food during Holidays.

4.1.8 Plotting In Store and food sales of Maverik store during each month of the year 2021,2022 and 2023

In [112]:
average_sales_2021 = t_data_2021.groupby('calendar_month')['daily_yoy_ndt.total_inside_sales'].mean()
average_sales_2022 = t_data_2022.groupby('calendar_month')['daily_yoy_ndt.total_inside_sales'].mean()
average_sales_2023 = t_data_2023.groupby('calendar_month')['daily_yoy_ndt.total_inside_sales'].mean()
months = range(1,13)
months_2023 = range(1, 9)

plt.figure(figsize=(10, 6))

plt.plot(months, average_sales_2021, label='2021')
plt.plot(months, average_sales_2022, label='2022')
plt.plot(months_2023, average_sales_2023, label='2023')

plt.xlabel('Month')
plt.ylabel('Average Sales')
plt.title('Average Inside Sales per Month (2021-2023)')
plt.legend()
plt.show()



average_sales_2021 = t_data_2021.groupby('calendar_month')['daily_yoy_ndt.total_food_service'].mean()
average_sales_2022 = t_data_2022.groupby('calendar_month')['daily_yoy_ndt.total_food_service'].mean()
average_sales_2023 = t_data_2023.groupby('calendar_month')['daily_yoy_ndt.total_food_service'].mean()

months_2023 = range(1, 9)

plt.figure(figsize=(10, 6))

plt.plot(months, average_sales_2021, label='2021')
plt.plot(months, average_sales_2022, label='2022')
plt.plot(months_2023, average_sales_2023, label='2023')

plt.xlabel('Month')
plt.ylabel('Average Sales')
plt.title('Average Food Sales per Month (2021-2023)')
plt.legend()
plt.show()
From the graphs depicted above, we can observe that the sale of in-store and food items increase during the end of Spring and whole summer season, it starts dipping during the end of Fall with the minimum sales observed during the last month of the year. We can roughly predict that this year sales are going to dip after Fall.

4.1.9 Plotting Diesel and Unleaded sales of Maverik store during each month of the year 2021,2022 and 2023

In [113]:
average_sales_2021 = t_data_2021.groupby('calendar_month')['diesel'].mean()
average_sales_2022 = t_data_2022.groupby('calendar_month')['diesel'].mean()
average_sales_2023 = t_data_2023.groupby('calendar_month')['diesel'].mean()

months_2023 = range(1, 9)

plt.figure(figsize=(10, 6))

plt.plot(months, average_sales_2021, label='2021')
plt.plot(months, average_sales_2022, label='2022')
plt.plot(months_2023, average_sales_2023, label='2023')

plt.xlabel('Month')
plt.ylabel('Average Sales')
plt.title('Average Diesel Sales per Month (2021-2023)')
plt.legend()
plt.show()



average_sales_2021 = t_data_2021.groupby('calendar_month')['unleaded'].mean()
average_sales_2022 = t_data_2022.groupby('calendar_month')['unleaded'].mean()
average_sales_2023 = t_data_2023.groupby('calendar_month')['unleaded'].mean()

months_2023 = range(1, 9)

plt.figure(figsize=(10, 6))

plt.plot(months, average_sales_2021, label='2021')
plt.plot(months, average_sales_2022, label='2022')
plt.plot(months_2023, average_sales_2023, label='2023')

plt.xlabel('Month')
plt.ylabel('Average Sales')
plt.title('Average unleaded Sales per Month (2021-2023)')
plt.legend()
plt.show()
We can see that year 2023 has the higher average sales of unleaded during July and August compared to the previous years whereas Diesel observed a good rise till July 2023 but there was a sudden dip in its average sales during August 2023.

4.2.0 Inside and Food Sales based on Seasonality

In [114]:
def get_season(month):
    if month in [12, 1, 2]:
        return 'Winter'
    elif month in [3, 4, 5]:
        return 'Spring'
    elif month in [6, 7, 8]:
        return 'Summer'
    else:
        return 'Fall'


t_data['season'] = t_data['calendar_month'].apply(get_season)


t_data_2021 = t_data[t_data['calendar_year'] == 2021].copy()


t_data_2022 = t_data[t_data['calendar_year'] == 2022].copy()


t_data_2023 = t_data[t_data['calendar_year'] == 2023].copy()

sales_by_season_2021 = t_data_2021.groupby('season')[['daily_yoy_ndt.total_inside_sales']].sum()


sales_by_season_2022 = t_data_2022.groupby('season')[['daily_yoy_ndt.total_inside_sales']].sum()


sales_by_season_2023 = t_data_2023.groupby('season')[['daily_yoy_ndt.total_inside_sales']].sum()


plt.figure(figsize=(18, 5))


plt.subplot(1, 3, 1)
plt.pie(sales_by_season_2021['daily_yoy_ndt.total_inside_sales'], labels=sales_by_season_2021.index, autopct='%1.1f%%', startangle=140, colors=['#ff9999','#66b3ff','#99ff99','#ffcc99'])
plt.title('Distribution of Inside Sales by Season in 2021')

plt.subplot(1, 3, 2)
plt.pie(sales_by_season_2022['daily_yoy_ndt.total_inside_sales'], labels=sales_by_season_2022.index, autopct='%1.1f%%', startangle=140, colors=['#ff9999','#66b3ff','#99ff99','#ffcc99'])
plt.title('Distribution of Inside Sales by Season in 2022')


plt.subplot(1, 3, 3)
plt.pie(sales_by_season_2023['daily_yoy_ndt.total_inside_sales'], labels=sales_by_season_2023.index, autopct='%1.1f%%', startangle=140, colors=['#66b3ff','#99ff99','#ffcc99'])
plt.title('Distribution of Inside Sales by Season in 2023')

plt.tight_layout()


sales_table_2021 = sales_by_season_2021.reset_index()
sales_table_2021.columns = ['Season', 'Total Sales (2021)']


sales_table_2022 = sales_by_season_2022.reset_index()
sales_table_2022.columns = ['Season', 'Total Sales (2022)']


sales_table_2023 = sales_by_season_2023.reset_index()
sales_table_2023.columns = ['Season', 'Total Sales (2023)']


with pd.option_context('display.float_format', '{:,.2f}'.format):
    print("Inside Sales for 2021:")
    display(sales_table_2021)
    print("\n Inside Sales for 2022:")
    display(sales_table_2022)
    print("\n Inside Sales for 2023:")
    display(sales_table_2023)

t_data['season'] = t_data['calendar_month'].apply(get_season)


t_data_2021 = t_data[t_data['calendar_year'] == 2021].copy()

t_data_2022 = t_data[t_data['calendar_year'] == 2022].copy()


t_data_2023 = t_data[t_data['calendar_year'] == 2023].copy()

sales_by_season_2021 = t_data_2021.groupby('season')[['daily_yoy_ndt.total_food_service']].sum()


sales_by_season_2022 = t_data_2022.groupby('season')[['daily_yoy_ndt.total_food_service']].sum()


sales_by_season_2023 = t_data_2023.groupby('season')[['daily_yoy_ndt.total_food_service']].sum()

plt.figure(figsize=(18, 5))

plt.subplot(1, 3, 1)
plt.pie(sales_by_season_2021['daily_yoy_ndt.total_food_service'], labels=sales_by_season_2021.index, autopct='%1.1f%%', startangle=140, colors=['#ff9999','#66b3ff','#99ff99','#ffcc99'])
plt.title('Distribution of Food Sales by Season in 2021')

plt.subplot(1, 3, 2)
plt.pie(sales_by_season_2022['daily_yoy_ndt.total_food_service'], labels=sales_by_season_2022.index, autopct='%1.1f%%', startangle=140, colors=['#ff9999','#66b3ff','#99ff99','#ffcc99'])
plt.title('Distribution of Food Sales by Season in 2022')

plt.subplot(1, 3, 3)
plt.pie(sales_by_season_2023['daily_yoy_ndt.total_food_service'], labels=sales_by_season_2023.index, autopct='%1.1f%%', startangle=140, colors=['#66b3ff','#99ff99','#ffcc99'])
plt.title('Distribution of Food Sales by Season in 2023')

plt.tight_layout()


sales_table_2021 = sales_by_season_2021.reset_index()
sales_table_2021.columns = ['Season', 'Total Sales (2021)']


sales_table_2022 = sales_by_season_2022.reset_index()
sales_table_2022.columns = ['Season', 'Total Sales (2022)']

sales_table_2023 = sales_by_season_2023.reset_index()
sales_table_2023.columns = ['Season', 'Total Sales (2023)']

with pd.option_context('display.float_format', '{:,.2f}'.format):
    print("Food Sales for 2021:")
    display(sales_table_2021)
    print("\n Food Sales for 2022:")
    display(sales_table_2022)
    print("\n Food Sales for 2023:")
    display(sales_table_2023)
Inside Sales for 2021:
Season Total Sales (2021)
0 Fall 4,965,353.13
1 Spring 1,479,716.70
2 Summer 3,803,032.47
3 Winter 2,255,542.94
 Inside Sales for 2022:
Season Total Sales (2022)
0 Fall 5,030,273.83
1 Spring 6,130,527.21
2 Summer 6,121,676.82
3 Winter 4,458,846.58
 Inside Sales for 2023:
Season Total Sales (2023)
0 Spring 2,586,886.73
1 Summer 977,866.26
2 Winter 1,779,927.67
Food Sales for 2021:
Season Total Sales (2021)
0 Fall 1,309,059.60
1 Spring 436,452.34
2 Summer 996,153.76
3 Winter 611,970.67
 Food Sales for 2022:
Season Total Sales (2022)
0 Fall 1,401,922.47
1 Spring 1,551,493.74
2 Summer 1,600,273.02
3 Winter 1,177,648.00
 Food Sales for 2023:
Season Total Sales (2023)
0 Spring 719,865.86
1 Summer 265,347.58
2 Winter 498,812.66

4.2.1 Diesel and Unleaded Sales based on Seasonality

In [115]:
def get_season(month):
    if month in [12, 1, 2]:
        return 'Winter'
    elif month in [3, 4, 5]:
        return 'Spring'
    elif month in [6, 7, 8]:
        return 'Summer'
    else:
        return 'Fall'


t_data['season'] = t_data['calendar_month'].apply(get_season)


t_data_2021 = t_data[t_data['calendar_year'] == 2021].copy()


t_data_2022 = t_data[t_data['calendar_year'] == 2022].copy()


t_data_2023 = t_data[t_data['calendar_year'] == 2023].copy()


sales_by_season_2021 = t_data_2021.groupby('season')[['diesel']].sum()

sales_by_season_2022 = t_data_2022.groupby('season')[['diesel']].sum()

sales_by_season_2023 = t_data_2023.groupby('season')[['diesel']].sum()

plt.figure(figsize=(18, 5))


plt.subplot(1, 3, 1)
plt.pie(sales_by_season_2021['diesel'], labels=sales_by_season_2021.index, autopct='%1.1f%%', startangle=140, colors=['#ff9999','#66b3ff','#99ff99','#ffcc99'])
plt.title('Distribution of Sales by Season in 2021')


plt.subplot(1, 3, 2)
plt.pie(sales_by_season_2022['diesel'], labels=sales_by_season_2022.index, autopct='%1.1f%%', startangle=140, colors=['#ff9999','#66b3ff','#99ff99','#ffcc99'])
plt.title('Distribution of Sales by Season in 2022')

plt.subplot(1, 3, 3)
plt.pie(sales_by_season_2023['diesel'], labels=sales_by_season_2023.index, autopct='%1.1f%%', startangle=140, colors=['#66b3ff','#99ff99','#ffcc99'])
plt.title('Distribution of Sales by Season in 2023')

plt.tight_layout()


sales_table_2021 = sales_by_season_2021.reset_index()
sales_table_2021.columns = ['Season', 'Total Sales (2021)']


sales_table_2022 = sales_by_season_2022.reset_index()
sales_table_2022.columns = ['Season', 'Total Sales (2022)']


sales_table_2023 = sales_by_season_2023.reset_index()
sales_table_2023.columns = ['Season', 'Total Sales (2023)']


with pd.option_context('display.float_format', '{:,.2f}'.format):
    print("Sales for 2021:")
    display(sales_table_2021)
    print("\nSales for 2022:")
    display(sales_table_2022)
    print("\nSales for 2023:")
    display(sales_table_2023)


t_data['season'] = t_data['calendar_month'].apply(get_season)


t_data_2021 = t_data[t_data['calendar_year'] == 2021].copy()


t_data_2022 = t_data[t_data['calendar_year'] == 2022].copy()


t_data_2023 = t_data[t_data['calendar_year'] == 2023].copy()


sales_by_season_2021 = t_data_2021.groupby('season')[['unleaded']].sum()


sales_by_season_2022 = t_data_2022.groupby('season')[['unleaded']].sum()


sales_by_season_2023 = t_data_2023.groupby('season')[['unleaded']].sum()


plt.figure(figsize=(18, 5))


plt.subplot(1, 3, 1)
plt.pie(sales_by_season_2021['unleaded'], labels=sales_by_season_2021.index, autopct='%1.1f%%', startangle=140, colors=['#ff9999','#66b3ff','#99ff99','#ffcc99'])
plt.title('Distribution of Sales by Season in 2021')

plt.subplot(1, 3, 2)
plt.pie(sales_by_season_2022['unleaded'], labels=sales_by_season_2022.index, autopct='%1.1f%%', startangle=140, colors=['#ff9999','#66b3ff','#99ff99','#ffcc99'])
plt.title('Distribution of Sales by Season in 2022')


plt.subplot(1, 3, 3)
plt.pie(sales_by_season_2023['unleaded'], labels=sales_by_season_2023.index, autopct='%1.1f%%', startangle=140, colors=['#66b3ff','#99ff99','#ffcc99'])
plt.title('Distribution of Sales by Season in 2023')

plt.tight_layout()


sales_table_2021 = sales_by_season_2021.reset_index()
sales_table_2021.columns = ['Season', 'Total Sales (2021)']


sales_table_2022 = sales_by_season_2022.reset_index()
sales_table_2022.columns = ['Season', 'Total Sales (2022)']


sales_table_2023 = sales_by_season_2023.reset_index()
sales_table_2023.columns = ['Season', 'Total Sales (2023)']


with pd.option_context('display.float_format', '{:,.2f}'.format):
    print("Sales for 2021:")
    display(sales_table_2021)
    print("\nSales for 2022:")
    display(sales_table_2022)
    print("\nSales for 2023:")
    display(sales_table_2023)
Sales for 2021:
Season Total Sales (2021)
0 Fall 2,650,068.48
1 Spring 728,384.00
2 Summer 2,023,123.50
3 Winter 1,165,056.13
Sales for 2022:
Season Total Sales (2022)
0 Fall 3,590,740.64
1 Spring 3,743,812.36
2 Summer 3,859,752.97
3 Winter 2,790,836.70
Sales for 2023:
Season Total Sales (2023)
0 Spring 1,491,919.96
1 Summer 656,316.77
2 Winter 979,543.83
Sales for 2021:
Season Total Sales (2021)
0 Fall 4,275,515.47
1 Spring 1,055,160.85
2 Summer 2,703,529.66
3 Winter 2,056,975.09
Sales for 2022:
Season Total Sales (2022)
0 Fall 4,023,479.15
1 Spring 5,619,266.38
2 Summer 5,260,911.29
3 Winter 4,157,802.81
Sales for 2023:
Season Total Sales (2023)
0 Spring 1,904,366.26
1 Summer 715,509.25
2 Winter 1,357,613.57

4.2.2 Total sale of target variables for all site id's

In [116]:
grouped_data = t_data.groupby('site_id_msba').agg({
    'daily_yoy_ndt.total_inside_sales': 'sum',
    'daily_yoy_ndt.total_food_service': 'sum',
    'diesel': 'sum',
    'unleaded': 'sum'
}).reset_index()

grouped_data.set_index('site_id_msba', inplace=True)


ax = grouped_data.plot(kind='bar', figsize=(12, 8), width=0.8)
plt.xlabel('site_id_msba')
plt.ylabel('Total Sum')
plt.title('Total Sum of Sales and Fuel Types by site_id_msba')
plt.legend(title='Categories')


plt.tight_layout()
plt.show()

The above graph and table shows that site id 22085 recorded maximum sale of Merchandise and food service, site id 21980 recorded the maximum sale of Diesel and Site id 22260 recorded the maximum sale od Unleaded.

5. Merging both dataframes

In [117]:
df_merged=pd.merge(t_data,q_data, on='site_id_msba', how='left')
In [118]:
df_merged.shape
Out[118]:
(13908, 60)
In [119]:
df_merged.head()
Out[119]:
capital_projects.soft_opening_date calendar.fiscal_week_id_for_year calendar.day_of_week calendar_information.holiday calendar_information.type_of_day daily_yoy_ndt.total_inside_sales daily_yoy_ndt.total_food_service diesel unleaded site_id_msba ... hi_flow_lanes_layout hi_flow_lanes_stack_type hi_flow_lanes_fueling_positions_2 rv_lanes_fueling_positions_2 hi_flow_rv_lanes_layout hi_flow_rv_lanes_stack_type mens_toilet_count mens_urinal_count womens_toilet_count womens_sink_count
0 2022-06-14 25 Friday NONE WEEKDAY 2168.2920 861.6930 722.7745 1425.1020 24535 ... 1.0 0.0 5.0 4.0 1.0 0.0 1.0 1.0 2.0 2.0
1 2022-06-14 25 Wednesday NONE WEEKDAY 2051.5635 808.0275 730.4850 1436.2740 24535 ... 1.0 0.0 5.0 4.0 1.0 0.0 1.0 1.0 2.0 2.0
2 2022-06-14 25 Thursday NONE WEEKDAY 2257.5000 966.4410 895.7970 1594.3725 24535 ... 1.0 0.0 5.0 4.0 1.0 0.0 1.0 1.0 2.0 2.0
3 2022-06-14 26 Sunday NONE WEEKEND 1520.5925 542.3250 584.2900 1124.9280 24535 ... 1.0 0.0 5.0 4.0 1.0 0.0 1.0 1.0 2.0 2.0
4 2022-06-14 26 Monday NONE WEEKDAY 1897.6930 771.4525 852.2605 1640.2540 24535 ... 1.0 0.0 5.0 4.0 1.0 0.0 1.0 1.0 2.0 2.0

5 rows × 60 columns

5.1 Relation between customer population and income

In [120]:
plt.scatter(df_merged['x1_mile_pop'], df_merged['x1_mile_income'])
plt.xlabel('Population within 1-mile radius')
plt.ylabel('Median income of 1-mile radius population')
plt.title('Scatter Plot: 1-mile Population vs. 1-mile Income')
plt.show()

plt.scatter(df_merged['x1_2_mile_pop'], df_merged['x1_2_mile_income'])
plt.xlabel('Population within 1/2-mile radius')
plt.ylabel('Median income of 1/2-mile radius population')
plt.title('Scatter Plot: 1/2-mile Population vs. 1/2-mile Income')
plt.show()

plt.scatter(df_merged['x5_min_pop'], df_merged['x5_min_inc'])
plt.xlabel('Population within 5-minute radius')
plt.ylabel('Median income of 5-minute radius population')
plt.title('Scatter Plot: 5-minute Population vs. 5-minute Income')
plt.show()

plt.scatter(df_merged['x7_min_pop'], df_merged['x7_min_inc'])
plt.xlabel('Population within 7-minute radius')
plt.ylabel('Median income of 7-minute radius population')
plt.title('Scatter Plot: 7-minute Population vs. 7-minute Income')
plt.show()

5.2 Population in Miles vs Total Sales

In [121]:
df_merged['total_sales'] = df_merged['daily_yoy_ndt.total_food_service'] + df_merged['daily_yoy_ndt.total_inside_sales']
total_sales_1_mile = df_merged[df_merged['x1_mile_pop'] > 0]['total_sales'].sum()
total_sales_2_mile = df_merged[df_merged['x1_2_mile_pop'] > 0]['total_sales'].sum()
total_sales_5_min = df_merged[df_merged['x5_min_pop'] > 0]['total_sales'].sum()
total_sales_7_min = df_merged[df_merged['x7_min_pop'] > 0]['total_sales'].sum()


labels = ['1 Mile', '2 Miles', '5 Minutes', '7 Minutes']
values = [total_sales_1_mile, total_sales_2_mile, total_sales_5_min, total_sales_7_min]

plt.figure(figsize=(8, 8))
plt.pie(values, labels=labels, autopct='%1.1f%%', startangle=140)
plt.title('Contribution of Total Sales by Population Radius')
plt.show()

5.3 Customer Income vs Total Sales

In [122]:
income_bins = [20000, 30000, 50000, 70000, 90000, 110000, 1000000]
income_labels = ['20k-30k', '30k-50k', '50k-70k', '70k-90k', '90k-110k', '100k+']

df_merged['x1_mile_income_bin'] = pd.cut(df_merged['x1_mile_income'], bins=income_bins, labels=income_labels, right=False)
df_merged['x1_2_mile_income_bin'] = pd.cut(df_merged['x1_2_mile_income'], bins=income_bins, labels=income_labels, right=False)
df_merged['x5_min_inc_bin'] = pd.cut(df_merged['x5_min_inc'], bins=income_bins, labels=income_labels, right=False)
df_merged['x7_min_inc_bin'] = pd.cut(df_merged['x7_min_inc'], bins=income_bins, labels=income_labels, right=False)

df_merged['total_sales'] = df_merged['daily_yoy_ndt.total_inside_sales'] + df_merged['daily_yoy_ndt.total_food_service']


total_sales_by_income_x1_mile = df_merged.groupby('x1_mile_income_bin')['total_sales'].sum()
total_sales_by_income_x1_2_mile = df_merged.groupby('x1_2_mile_income_bin')['total_sales'].sum()
total_sales_by_income_x5_min = df_merged.groupby('x5_min_inc_bin')['total_sales'].sum()
total_sales_by_income_x7_min = df_merged.groupby('x7_min_inc_bin')['total_sales'].sum()

plt.figure(figsize=(12, 8))


plt.subplot(221)
total_sales_by_income_x1_mile.plot(kind='bar')
plt.title('Total Sales by x1_mile_income_bin')
plt.xlabel('Income Range')
plt.ylabel('Total Sales')


plt.subplot(222)
total_sales_by_income_x1_2_mile.plot(kind='bar')
plt.title('Total Sales by x1_2_mile_income_bin')
plt.xlabel('Income Range')
plt.ylabel('Total Sales')


plt.subplot(223)
total_sales_by_income_x5_min.plot(kind='bar')
plt.title('Total Sales by x5_min_inc_bin')
plt.xlabel('Income Range')
plt.ylabel('Total Sales')


plt.subplot(224)
total_sales_by_income_x7_min.plot(kind='bar')
plt.title('Total Sales by x7_min_inc_bin')
plt.xlabel('Income Range')
plt.ylabel('Total Sales')

plt.tight_layout()
plt.show()
Customers with income range between 30k to 70k contribute majorly towards total sales

5.4 Distribution of Sales based on Stores that Sell "freal," "bonfire_grill," "pizza," and "cinnabon"

In [123]:
sells_items = df_merged[(df_merged['freal'] > 0) | (df_merged['bonfire_grill'] > 0) | (df_merged['pizza'] > 0) | (df_merged['cinnabon'] > 0)]
does_not_sell_items = df_merged[(df_merged['freal'] == 0) & (df_merged['bonfire_grill'] == 0) & (df_merged['pizza'] == 0) & (df_merged['cinnabon'] == 0)]


avg_food_service_sells_items = sells_items['daily_yoy_ndt.total_food_service'].mean()
avg_food_service_does_not_sell_items = does_not_sell_items['daily_yoy_ndt.total_food_service'].mean()

store_type_counts = [len(sells_items), len(does_not_sell_items)]
labels = ['Sells Items', 'Does Not Sell Items']
plt.figure(figsize=(8, 6))
plt.pie(store_type_counts, labels=labels, autopct='%1.1f%%', startangle=140)
plt.title('Distribution of Store Types')
plt.show()
We can see that 70.3% of the food sales come from stores that sell food items and 29.7% from those that do not sell food items

5.5 Comparing Sale of Diesel and Unleaded based on Fuel Station Features

In [124]:
columns_to_visualize = ['cat_scales', 'propane', 'traditional_forecourt_fueling_positions',
                        'traditional_forecourt_layout', 'traditional_forecourt_stack_type',
                        'hi_flow_rv_lanes_layout', 'hi_flow_rv_lanes_stack_type']

for column in columns_to_visualize:

    sales_by_column = df_merged.groupby(column)[['diesel', 'unleaded']].sum()
    sales_by_column.plot(kind='bar', figsize=(10, 6))
    plt.title(f'Sales of Diesel and Unleaded by {column}')
    plt.xlabel(column)
    plt.ylabel('Total Sales')
    plt.xticks(rotation=0)
    plt.legend(['Diesel', 'Unleaded'])
    plt.show()
In [125]:
### comment to be added by Shivi

5.6 Comparing Sale of Diesel based on Fuel Station Features

In [126]:
numerical_columns = ['hi_flow_lanes', 'hi_flow_rv_lanes', 'def',
                     'hi_flow_lanes_fueling_positions', 'hi_flow_lanes_layout',
                     'hi_flow_lanes_stack_type', 'diesel']

subset_df = df_merged[numerical_columns]

correlation_matrix = subset_df.corr()

plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', linewidths=0.5)
plt.title('Correlation Heatmap')
plt.show()
We observe that sale of diesel increases if there is an increase in the number of hi flow lanes fueling positions.

5.7 Comparing Sale of Unleaded based on Fuel Station Features

In [127]:
numerical_columns =  [ 'ethanol_free', 'rv_lanes', 'rv_dumps',
                        'rv_lanes_fueling_positions', 'rv_lanes_layout',
                        'rv_lanes_stack_type', 'unleaded']

subset_df = df_merged[numerical_columns]


correlation_matrix = subset_df.corr()


plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', linewidths=0.5)
plt.title('Correlation Heatmap')
plt.show()
As reflected in the heat map, all the predictors except rv_dumps has a mild positive correlation where an increase in one unit of the value will lead to the increase in sale of unleaded

Dropping the columns not required during Modeling Phase

In [128]:
columns_to_drop = ['x1_mile_income_bin', 'x1_2_mile_income_bin', 'x5_min_inc_bin', 'x7_min_inc_bin']
df_merged = df_merged.drop(columns=columns_to_drop)

6.1 Feature Engineering

In [129]:
df_merged['calendar.calendar_day_date'] = t_data.index
In [130]:
df_merged.set_index('calendar.calendar_day_date', inplace=True)

One Hot Encoding

In [131]:
# Identifying categorical columns
categorical_columns = df_merged.select_dtypes(include=['object']).columns

# Applying one-hot encoding to categorical columns
encoded_df = pd.get_dummies(df_merged, columns=categorical_columns)
In [132]:
encoded_df.head()
Out[132]:
calendar.fiscal_week_id_for_year daily_yoy_ndt.total_inside_sales daily_yoy_ndt.total_food_service diesel unleaded site_id_msba calendar_year calendar_month calendar_day open_year ... calendar_information.holiday_Saint Valentine's Day calendar_information.holiday_Thanksgiving Day calendar_information.holiday_Veteran's Day calendar_information.holiday_Washington's Birthday calendar_information.type_of_day_WEEKDAY calendar_information.type_of_day_WEEKEND season_Fall season_Spring season_Summer season_Winter
calendar.calendar_day_date
2022-06-17 25 2168.2920 861.6930 722.7745 1425.1020 24535 2022 6 17 2022.0 ... 0 0 0 0 1 0 0 0 1 0
2022-06-22 25 2051.5635 808.0275 730.4850 1436.2740 24535 2022 6 22 2022.0 ... 0 0 0 0 1 0 0 0 1 0
2022-06-23 25 2257.5000 966.4410 895.7970 1594.3725 24535 2022 6 23 2022.0 ... 0 0 0 0 1 0 0 0 1 0
2022-06-26 26 1520.5925 542.3250 584.2900 1124.9280 24535 2022 6 26 2022.0 ... 0 0 0 0 0 1 0 0 1 0
2022-06-27 26 1897.6930 771.4525 852.2605 1640.2540 24535 2022 6 27 2022.0 ... 0 0 0 0 1 0 0 0 1 0

5 rows × 127 columns

'encoded_df' now contains the original columns along with one-hot encoded categorical columns

3. Modeling process

I. XGBoost Model

A.Target Variable = Food

Plotting the graph for sale of food over the years

In [133]:
average_sales = encoded_df['daily_yoy_ndt.total_food_service'].mean()
encoded_df['daily_yoy_ndt.total_food_service'].plot(style=".", figsize=(15,5), title="Food Sale over the Years")
plt.axhline(y=average_sales, color='r', linestyle='-', label='Average Sales')
plt.legend()
plt.show()

Train Test Split

In [134]:
#splitting train as date less than 1st jan 2023 and test as later

train= encoded_df['daily_yoy_ndt.total_food_service'].loc[encoded_df.index< '2023-01-01']
test= encoded_df['daily_yoy_ndt.total_food_service'].loc[encoded_df.index>= '2023-01-01']
fig, ax= plt.subplots(figsize=(15,5))
train.plot(ax=ax, label='training set')
test.plot(ax=ax, label='Test set')
ax.axvline('2023-01-01',color='black',ls='--')
ax.legend(['Training Set','Test Set'])
plt.show()

Splitting data into training and testing sets

In [135]:
# Features (X) and Target (y)
X = encoded_df.drop(columns=['daily_yoy_ndt.total_food_service'])
y = encoded_df['daily_yoy_ndt.total_food_service']

train_end_date = '2023-01-01'
X_train = X[X.index < train_end_date]
X_test = X[X.index >= train_end_date]
y_train = y[y.index < train_end_date]
y_test = y[y.index >= train_end_date]

We now have X_train, X_test, y_train, and y_test ready for modeling

In [136]:
reg = xgb.XGBRegressor(base_score=0.5, booster='gbtree',
                       n_estimators=1000,
                       early_stopping_rounds=50,
                       objective='reg:linear',
                       max_depth=3,
                       learning_rate=0.01)
reg.fit(X_train, y_train,
        eval_set=[(X_train, y_train), (X_test, y_test)],
        verbose=100)
[21:55:04] WARNING: C:/Users/administrator/workspace/xgboost-win64_release_1.6.0/src/objective/regression_obj.cu:203: reg:linear is now deprecated in favor of reg:squarederror.
[0]	validation_0-rmse:822.33871	validation_1-rmse:839.36357
[100]	validation_0-rmse:317.56040	validation_1-rmse:339.52415
[200]	validation_0-rmse:140.56754	validation_1-rmse:151.60532
[300]	validation_0-rmse:84.29154	validation_1-rmse:87.95342
[400]	validation_0-rmse:68.09734	validation_1-rmse:69.79305
[500]	validation_0-rmse:62.11972	validation_1-rmse:67.61086
[581]	validation_0-rmse:59.54191	validation_1-rmse:68.03542
Out[136]:
XGBRegressor(base_score=0.5, booster='gbtree', callbacks=None,
             colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1,
             early_stopping_rounds=50, enable_categorical=False,
             eval_metric=None, gamma=0, gpu_id=-1, grow_policy='depthwise',
             importance_type=None, interaction_constraints='',
             learning_rate=0.01, max_bin=256, max_cat_to_onehot=4,
             max_delta_step=0, max_depth=3, max_leaves=0, min_child_weight=1,
             missing=nan, monotone_constraints='()', n_estimators=1000,
             n_jobs=0, num_parallel_tree=1, objective='reg:linear',
             predictor='auto', random_state=0, reg_alpha=0, ...)

Identifying Top 10 features impacting the Food Sale

In [137]:
fi = pd.DataFrame(data=reg.feature_importances_,
             index=reg.feature_names_in_,
             columns=['importance'])
top_10_features = fi.sort_values(by='importance', ascending=False).head(10)
top_10_features.plot(kind='barh', title='Top 10 Feature Importance')
Out[137]:
<matplotlib.axes._subplots.AxesSubplot at 0x231919a0630>

Plotting the graph to depict Train and Test Split

In [138]:
train = encoded_df[['daily_yoy_ndt.total_food_service','x1_2_mile_pop','x7_min_pop','x1_mile_pop','womens_sink_count','hi_flow_lanes','square_feet','x1_2_mile_emp','x5_min_pop']].loc[encoded_df.index < '2023-01-01']
test = encoded_df[['daily_yoy_ndt.total_food_service','x1_2_mile_pop','x7_min_pop','x1_mile_pop','womens_sink_count','hi_flow_lanes','square_feet','x1_2_mile_emp','x5_min_pop']].loc[encoded_df.index >= '2023-01-01']

fig, ax = plt.subplots(figsize=(15, 5))
train['daily_yoy_ndt.total_food_service'].plot(ax=ax, label='Training Set', title='Data Train/Test Split')
test['daily_yoy_ndt.total_food_service'].plot(ax=ax, label='Test Set')
ax.axvline('2023-01-01', color='black', ls='--')
ax.legend(['Training Set', 'Test Set'])
plt.show()

Time series cross validation

In [139]:
tss = TimeSeriesSplit(n_splits=5)
df = encoded_df[['daily_yoy_ndt.total_food_service','x1_2_mile_pop','x7_min_pop','x1_mile_pop','womens_sink_count','hi_flow_lanes','square_feet','x1_2_mile_emp','x5_min_pop']].sort_index()
In [140]:
fig, axs = plt.subplots(5, 1, figsize=(15, 15), sharex=True)

fold = 0
for train_idx, val_idx in tss.split(df):
    train = df.iloc[train_idx]
    test = df.iloc[val_idx]
    train['daily_yoy_ndt.total_food_service'].plot(ax=axs[fold],
                          label='Training Set',
                          title=f'Data Train/Test Split Fold {fold}')
    test['daily_yoy_ndt.total_food_service'].plot(ax=axs[fold],
                         label='Test Set')
    axs[fold].axvline(test.index.min(), color='black', ls='--')
    fold += 1
plt.show()

Creating time series features based on time series index

In [141]:
def create_features(df):
   
    df = df.copy()
    df['dayofweek'] = df.index.dayofweek
    df['quarter'] = df.index.quarter
    df['month'] = df.index.month
    df['year'] = df.index.year
    df['dayofyear'] = df.index.dayofyear
    df['dayofmonth'] = df.index.day
    return df

df = create_features(df)
In [142]:
df.head()
Out[142]:
daily_yoy_ndt.total_food_service x1_2_mile_pop x7_min_pop x1_mile_pop womens_sink_count hi_flow_lanes square_feet x1_2_mile_emp x5_min_pop dayofweek quarter month year dayofyear dayofmonth
calendar.calendar_day_date
2021-01-12 762.8530 556.0 13895.0 4046.0 2.0 0.0 5046.0 642.0 4776.0 1 1 1 2021 12 12
2021-01-13 1003.7930 556.0 13895.0 4046.0 2.0 0.0 5046.0 642.0 4776.0 2 1 1 2021 13 13
2021-01-14 974.2250 556.0 13895.0 4046.0 2.0 0.0 5046.0 642.0 4776.0 3 1 1 2021 14 14
2021-01-15 911.0115 556.0 13895.0 4046.0 2.0 0.0 5046.0 642.0 4776.0 4 1 1 2021 15 15
2021-01-16 715.7535 556.0 13895.0 4046.0 2.0 0.0 5046.0 642.0 4776.0 5 1 1 2021 16 16

Defining Lags

In [143]:
def add_lags(df):
  target_map= df['daily_yoy_ndt.total_food_service'].to_dict()
  df['lag1'] = (df.index - pd.Timedelta('364 days')).map(target_map)
  return df
In [144]:
df = add_lags(df)
In [145]:
df.tail()
Out[145]:
daily_yoy_ndt.total_food_service x1_2_mile_pop x7_min_pop x1_mile_pop womens_sink_count hi_flow_lanes square_feet x1_2_mile_emp x5_min_pop dayofweek quarter month year dayofyear dayofmonth lag1
calendar.calendar_day_date
2023-08-12 843.5980 1003.0 26142.0 3629.0 2.0 1.0 2933.0 656.0 15721.0 5 3 8 2023 224 12 427.9555
2023-08-13 719.3655 1003.0 26142.0 3629.0 2.0 1.0 2933.0 656.0 15721.0 6 3 8 2023 225 13 719.5720
2023-08-14 772.2015 1003.0 26142.0 3629.0 2.0 1.0 2933.0 656.0 15721.0 0 3 8 2023 226 14 1185.7580
2023-08-15 843.6260 1003.0 26142.0 3629.0 2.0 1.0 2933.0 656.0 15721.0 1 3 8 2023 227 15 1023.3580
2023-08-16 849.2155 1003.0 26142.0 3629.0 2.0 1.0 2933.0 656.0 15721.0 2 3 8 2023 228 16 1043.0910

Fitting the Model to calculate Evaluation Metrics

In [146]:
tss = TimeSeriesSplit(n_splits=5)
df = df.sort_index()


fold = 0
preds = []
scores = []
for train_idx, val_idx in tss.split(df):
    train = df.iloc[train_idx]
    test = df.iloc[val_idx]


    TARGET = 'daily_yoy_ndt.total_food_service'

    X_train = train
    y_train = train[TARGET]

    X_test = test
    y_test = test[TARGET]

    reg = xgb.XGBRegressor(base_score=0.5, booster='gbtree',
                           n_estimators=1000,
                           early_stopping_rounds=50,
                           objective='reg:linear',
                           max_depth=3,
                           learning_rate=0.01)
    reg.fit(X_train, y_train,
            eval_set=[(X_train, y_train), (X_test, y_test)],
            verbose=100)

    y_pred = reg.predict(X_test)
    preds.append(y_pred)
    score = np.sqrt(mean_squared_error(y_test, y_pred))
    scores.append(score)
[21:56:42] WARNING: C:/Users/administrator/workspace/xgboost-win64_release_1.6.0/src/objective/regression_obj.cu:203: reg:linear is now deprecated in favor of reg:squarederror.
[0]	validation_0-rmse:900.37882	validation_1-rmse:769.03607
[100]	validation_0-rmse:333.72987	validation_1-rmse:291.60915
[200]	validation_0-rmse:124.22276	validation_1-rmse:114.34814
[300]	validation_0-rmse:46.66751	validation_1-rmse:49.10119
[400]	validation_0-rmse:17.79741	validation_1-rmse:25.39983
[500]	validation_0-rmse:7.02936	validation_1-rmse:16.91861
[600]	validation_0-rmse:2.99848	validation_1-rmse:13.20890
[700]	validation_0-rmse:1.46336	validation_1-rmse:11.38510
[800]	validation_0-rmse:0.81716	validation_1-rmse:10.39009
[900]	validation_0-rmse:0.50088	validation_1-rmse:9.82688
[999]	validation_0-rmse:0.33851	validation_1-rmse:9.50311
[21:56:47] WARNING: C:/Users/administrator/workspace/xgboost-win64_release_1.6.0/src/objective/regression_obj.cu:203: reg:linear is now deprecated in favor of reg:squarederror.
[0]	validation_0-rmse:837.20365	validation_1-rmse:725.44117
[100]	validation_0-rmse:309.95300	validation_1-rmse:267.62648
[200]	validation_0-rmse:115.26811	validation_1-rmse:99.66830
[300]	validation_0-rmse:43.26500	validation_1-rmse:38.71476
[400]	validation_0-rmse:16.50061	validation_1-rmse:16.92296
[500]	validation_0-rmse:6.47389	validation_1-rmse:9.10330
[600]	validation_0-rmse:2.67423	validation_1-rmse:5.93457
[700]	validation_0-rmse:1.22950	validation_1-rmse:4.42796
[800]	validation_0-rmse:0.67823	validation_1-rmse:3.54941
[900]	validation_0-rmse:0.43899	validation_1-rmse:3.02571
[999]	validation_0-rmse:0.33011	validation_1-rmse:2.71537
[21:56:54] WARNING: C:/Users/administrator/workspace/xgboost-win64_release_1.6.0/src/objective/regression_obj.cu:203: reg:linear is now deprecated in favor of reg:squarederror.
[0]	validation_0-rmse:801.68127	validation_1-rmse:826.67025
[100]	validation_0-rmse:296.62474	validation_1-rmse:305.02969
[200]	validation_0-rmse:110.36454	validation_1-rmse:112.37237
[300]	validation_0-rmse:41.42941	validation_1-rmse:41.49164
[400]	validation_0-rmse:15.73554	validation_1-rmse:15.43007
[500]	validation_0-rmse:6.09038	validation_1-rmse:5.78545
[600]	validation_0-rmse:2.43797	validation_1-rmse:2.24685
[700]	validation_0-rmse:1.06120	validation_1-rmse:1.01430
[800]	validation_0-rmse:0.55955	validation_1-rmse:0.67326
[900]	validation_0-rmse:0.36893	validation_1-rmse:0.59696
[999]	validation_0-rmse:0.29946	validation_1-rmse:0.58065
[21:57:02] WARNING: C:/Users/administrator/workspace/xgboost-win64_release_1.6.0/src/objective/regression_obj.cu:203: reg:linear is now deprecated in favor of reg:squarederror.
[0]	validation_0-rmse:807.99490	validation_1-rmse:884.13938
[100]	validation_0-rmse:298.62104	validation_1-rmse:326.05306
[200]	validation_0-rmse:110.91692	validation_1-rmse:120.04271
[300]	validation_0-rmse:41.54840	validation_1-rmse:44.24722
[400]	validation_0-rmse:15.74063	validation_1-rmse:16.48211
[500]	validation_0-rmse:6.06862	validation_1-rmse:6.21427
[600]	validation_0-rmse:2.41635	validation_1-rmse:2.41230
[700]	validation_0-rmse:1.03831	validation_1-rmse:1.08227
[800]	validation_0-rmse:0.54864	validation_1-rmse:0.71190
[900]	validation_0-rmse:0.37019	validation_1-rmse:0.63380
[999]	validation_0-rmse:0.30829	validation_1-rmse:0.61911
[21:57:09] WARNING: C:/Users/administrator/workspace/xgboost-win64_release_1.6.0/src/objective/regression_obj.cu:203: reg:linear is now deprecated in favor of reg:squarederror.
[0]	validation_0-rmse:823.77780	validation_1-rmse:827.65419
[100]	validation_0-rmse:304.23759	validation_1-rmse:304.14721
[200]	validation_0-rmse:112.89267	validation_1-rmse:111.89526
[300]	validation_0-rmse:42.22505	validation_1-rmse:41.23763
[400]	validation_0-rmse:15.95947	validation_1-rmse:15.30038
[500]	validation_0-rmse:6.13029	validation_1-rmse:5.72500
[600]	validation_0-rmse:2.42964	validation_1-rmse:2.18086
[700]	validation_0-rmse:1.03651	validation_1-rmse:0.89595
[800]	validation_0-rmse:0.54831	validation_1-rmse:0.49164
[900]	validation_0-rmse:0.37871	validation_1-rmse:0.38983
[999]	validation_0-rmse:0.32264	validation_1-rmse:0.36799

Calculation of evaluation metrics

In [147]:
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

# Print the evaluation metrics
print(f'Mean Absolute Error (MAE): {mae:.2f}')
print(f'Mean Squared Error (MSE): {mse:.2f}')
print(f'R-squared (R2) Score: {r2:.2f}')
Mean Absolute Error (MAE): 0.27
Mean Squared Error (MSE): 0.14
R-squared (R2) Score: 1.00
In [148]:
print(f'Score across folds {np.mean(scores):0.4f}')
print(f'Fold scores:{scores}')
Score across folds 2.7572
Fold scores:[9.503109583932252, 2.715367362716839, 0.580649666481034, 0.6191068521879374, 0.3679864222976111]

Retraining all data

In [149]:
future = pd.date_range('2023-08-17','2024-12-31', freq='D')
future_df = pd.DataFrame(index=future)
df['isFuture'] = False

TARGET = 'daily_yoy_ndt.total_food_service'

# Define features (excluding the target and 'isFuture' column)
FEATURES = ['x1_2_mile_pop','x7_min_pop','x1_mile_pop','womens_sink_count','hi_flow_lanes','square_feet','x1_2_mile_emp','x5_min_pop',
            'dayofweek', 'quarter', 'month', 'year', 'dayofyear',
            'dayofmonth', 'lag1','isFuture']

# Split the data into features (X) and target (y)
X = df[FEATURES]
y = df[TARGET]

# Create the XGBoost model
reg = xgb.XGBRegressor(base_score=0.5,
                       booster='gbtree',
                       n_estimators=500,
                       objective='reg:linear',
                       max_depth=3,
                       learning_rate=0.01)

# Fit the model
reg.fit(X, y, eval_set=[(X, y)], verbose=100)
[21:57:17] WARNING: C:/Users/administrator/workspace/xgboost-win64_release_1.6.0/src/objective/regression_obj.cu:203: reg:linear is now deprecated in favor of reg:squarederror.
[0]	validation_0-rmse:824.88453
[100]	validation_0-rmse:354.95450
[200]	validation_0-rmse:214.38246
[300]	validation_0-rmse:175.43089
[400]	validation_0-rmse:160.60600
[499]	validation_0-rmse:151.34876
Out[149]:
XGBRegressor(base_score=0.5, booster='gbtree', callbacks=None,
             colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1,
             early_stopping_rounds=None, enable_categorical=False,
             eval_metric=None, gamma=0, gpu_id=-1, grow_policy='depthwise',
             importance_type=None, interaction_constraints='',
             learning_rate=0.01, max_bin=256, max_cat_to_onehot=4,
             max_delta_step=0, max_depth=3, max_leaves=0, min_child_weight=1,
             missing=nan, monotone_constraints='()', n_estimators=500, n_jobs=0,
             num_parallel_tree=1, objective='reg:linear', predictor='auto',
             random_state=0, reg_alpha=0, ...)

Creating Future Dataframe

In [150]:
future = pd.date_range('2023-08-17','2024-12-31', freq='D')
future_df = pd.DataFrame(index=future)
future_df['isFuture'] = True
df['isFuture'] = False
df_and_future = pd.concat([df, future_df])
df_and_future = create_features(df_and_future)
df_and_future = add_lags(df_and_future)
In [151]:
future_w_features = df_and_future.query('isFuture').copy()
In [152]:
future_w_features.head()
Out[152]:
daily_yoy_ndt.total_food_service x1_2_mile_pop x7_min_pop x1_mile_pop womens_sink_count hi_flow_lanes square_feet x1_2_mile_emp x5_min_pop dayofweek quarter month year dayofyear dayofmonth lag1 isFuture
2023-08-17 NaN NaN NaN NaN NaN NaN NaN NaN NaN 3 3 8 2023 229 17 1078.1330 True
2023-08-18 NaN NaN NaN NaN NaN NaN NaN NaN NaN 4 3 8 2023 230 18 697.9175 True
2023-08-19 NaN NaN NaN NaN NaN NaN NaN NaN NaN 5 3 8 2023 231 19 531.1845 True
2023-08-20 NaN NaN NaN NaN NaN NaN NaN NaN NaN 6 3 8 2023 232 20 366.8105 True
2023-08-21 NaN NaN NaN NaN NaN NaN NaN NaN NaN 0 3 8 2023 233 21 1395.0475 True
In [153]:
future_w_features = future_w_features.drop('daily_yoy_ndt.total_food_service', axis=1)
In [154]:
future_w_features['pred'] = reg.predict(future_w_features)
In [155]:
future_w_features.index = pd.to_datetime(future_w_features.index)  # Make sure index is datetime
plt.figure(figsize=(10, 5))
plt.plot(future_w_features.index, future_w_features['pred'], color='r', marker='o', linestyle='-', linewidth=1, markersize=3)
plt.title('Future Predictions For Food Sales ')
plt.xlabel('Date')
plt.ylabel('Inside Sales')
plt.grid(True)
plt.show()

Combining Future Prediction with the Historical Data

In [156]:
# Plotting historical data
plt.figure(figsize=(15, 5))

# Plot the entire historical data
encoded_df['daily_yoy_ndt.total_food_service'].plot(label='Historical Data', color='b')
plt.axvline('2023-08-17', color='black', ls='--')
# Plot future predictions
plt.plot(future_w_features.index, future_w_features['pred'], label='Future Predictions', color='red', linestyle='dashed')

# Set title and labels
plt.title('Historical Data and Future Predictions of Total Food Sales')
plt.xlabel('Date')
plt.ylabel('Value')

# Add legend
plt.legend()

# Show plot
plt.show()

Displaying the data for future predictions

In [157]:
start_date = '2023-08-17'
end_date = pd.to_datetime(start_date) + pd.DateOffset(days=502)
date_range = pd.date_range(start=start_date, end=end_date, freq='D')


# Create a DataFrame for the forecasted sales
forecasted_sales_df = pd.DataFrame({'Date': date_range, 'Sales': future_w_features['pred']})

# Display the DataFrame
print(forecasted_sales_df)
                 Date       Sales
2023-08-17 2023-08-17  575.633667
2023-08-18 2023-08-18  575.633667
2023-08-19 2023-08-19  461.064240
2023-08-20 2023-08-20  362.186798
2023-08-21 2023-08-21  544.258667
...               ...         ...
2024-12-27 2024-12-27  458.436890
2024-12-28 2024-12-28  366.741882
2024-12-29 2024-12-29  271.028229
2024-12-30 2024-12-30  453.466095
2024-12-31 2024-12-31  456.233978

[503 rows x 2 columns]

B.Target Variable = Inside Sales

Plotting the graph for sale of food over the years

In [158]:
average_sales = encoded_df['daily_yoy_ndt.total_inside_sales'].mean()
encoded_df['daily_yoy_ndt.total_inside_sales'].plot(style=".", figsize=(15,5), title="Diesel Sale over the Years")
plt.axhline(y=average_sales, color='r', linestyle='-', label='Average Sales')
plt.legend()
plt.show()

Train Test Split

In [159]:
train= encoded_df['daily_yoy_ndt.total_inside_sales'].loc[encoded_df.index< '2023-01-01']
test= encoded_df['daily_yoy_ndt.total_inside_sales'].loc[encoded_df.index>= '2023-01-01']
fig, ax= plt.subplots(figsize=(15,5))
train.plot(ax=ax, label='training set')
test.plot(ax=ax, label='Test set')
ax.axvline('2023-01-01',color='black',ls='--')
ax.legend(['Training Set','Test Set'])
plt.show()

Splitting data into training and testing sets

In [160]:
# Features (X) and Target (y)
X = encoded_df.drop(columns=['daily_yoy_ndt.total_inside_sales'])
y = encoded_df['daily_yoy_ndt.total_inside_sales']

# Split data into training and testing sets
train_end_date = '2023-01-01'
X_train = X[X.index < train_end_date]
X_test = X[X.index >= train_end_date]
y_train = y[y.index < train_end_date]
y_test = y[y.index >= train_end_date]

We now have X_train, X_test, y_train, and y_test ready for modeling

In [161]:
reg = xgb.XGBRegressor(base_score=0.5, booster='gbtree',
                       n_estimators=1000,
                       early_stopping_rounds=50,
                       objective='reg:linear',
                       max_depth=3,
                       learning_rate=0.01)
reg.fit(X_train, y_train,
        eval_set=[(X_train, y_train), (X_test, y_test)],
        verbose=100)
[21:57:24] WARNING: C:/Users/administrator/workspace/xgboost-win64_release_1.6.0/src/objective/regression_obj.cu:203: reg:linear is now deprecated in favor of reg:squarederror.
[0]	validation_0-rmse:2983.10463	validation_1-rmse:2963.56706
[100]	validation_0-rmse:1104.08629	validation_1-rmse:1074.80152
[200]	validation_0-rmse:420.99665	validation_1-rmse:398.57480
[300]	validation_0-rmse:177.21011	validation_1-rmse:167.60036
[400]	validation_0-rmse:96.25162	validation_1-rmse:94.05203
[500]	validation_0-rmse:70.59861	validation_1-rmse:71.34147
[600]	validation_0-rmse:59.60515	validation_1-rmse:61.46005
[700]	validation_0-rmse:53.54319	validation_1-rmse:56.74476
[800]	validation_0-rmse:48.92327	validation_1-rmse:53.30744
[900]	validation_0-rmse:44.86184	validation_1-rmse:51.36494
[999]	validation_0-rmse:41.56215	validation_1-rmse:49.58048
Out[161]:
XGBRegressor(base_score=0.5, booster='gbtree', callbacks=None,
             colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1,
             early_stopping_rounds=50, enable_categorical=False,
             eval_metric=None, gamma=0, gpu_id=-1, grow_policy='depthwise',
             importance_type=None, interaction_constraints='',
             learning_rate=0.01, max_bin=256, max_cat_to_onehot=4,
             max_delta_step=0, max_depth=3, max_leaves=0, min_child_weight=1,
             missing=nan, monotone_constraints='()', n_estimators=1000,
             n_jobs=0, num_parallel_tree=1, objective='reg:linear',
             predictor='auto', random_state=0, reg_alpha=0, ...)

Identifying Top 10 features impacting the Inside Sale

In [162]:
fi = pd.DataFrame(data=reg.feature_importances_,
             index=reg.feature_names_in_,
             columns=['importance'])
top_10_features = fi.sort_values(by='importance', ascending=False).head(10)
top_10_features.plot(kind='barh', title='Top 10 Feature Importance')
Out[162]:
<matplotlib.axes._subplots.AxesSubplot at 0x2318ff1ba90>

Plotting the graph to depict Train and Test Split

In [163]:
train = encoded_df[['daily_yoy_ndt.total_inside_sales','x5_min_pop','x1_2_mile_emp','square_feet','hi_flow_lanes','womens_sink_count','x1_mile_pop','x7_min_pop','x1_2_mile_pop']].loc[encoded_df.index < '2023-01-01']
test = encoded_df[['daily_yoy_ndt.total_inside_sales','x5_min_pop','x1_2_mile_emp','square_feet','hi_flow_lanes','womens_sink_count','x1_mile_pop','x7_min_pop','x1_2_mile_pop']].loc[encoded_df.index >= '2023-01-01']

fig, ax = plt.subplots(figsize=(15, 5))
train['daily_yoy_ndt.total_inside_sales'].plot(ax=ax, label='Training Set', title='Data Train/Test Split')
test['daily_yoy_ndt.total_inside_sales'].plot(ax=ax, label='Test Set')
ax.axvline('2023-01-01', color='black', ls='--')
ax.legend(['Training Set', 'Test Set'])
plt.show()

Time series cross validation

In [164]:
tss = TimeSeriesSplit(n_splits=5)
df = encoded_df[['daily_yoy_ndt.total_inside_sales','x5_min_pop','x1_2_mile_emp','square_feet','hi_flow_lanes','womens_sink_count','x1_mile_pop','x7_min_pop','x1_2_mile_pop']].sort_index()
In [165]:
fig, axs = plt.subplots(5, 1, figsize=(15, 15), sharex=True)

fold = 0
for train_idx, val_idx in tss.split(df):
    train = df.iloc[train_idx]
    test = df.iloc[val_idx]
    train['daily_yoy_ndt.total_inside_sales'].plot(ax=axs[fold],
                          label='Training Set',
                          title=f'Data Train/Test Split Fold {fold}')
    test['daily_yoy_ndt.total_inside_sales'].plot(ax=axs[fold],
                         label='Test Set')
    axs[fold].axvline(test.index.min(), color='black', ls='--')
    fold += 1
plt.show()

Creating time series features based on time series index

In [166]:
def create_features(df):
  
    df = df.copy()
    df['dayofweek'] = df.index.dayofweek
    df['quarter'] = df.index.quarter
    df['month'] = df.index.month
    df['year'] = df.index.year
    df['dayofyear'] = df.index.dayofyear
    df['dayofmonth'] = df.index.day
    return df

df = create_features(df)

Defining Lags

In [167]:
def add_lags(df):
  target_map= df['daily_yoy_ndt.total_inside_sales'].to_dict()
  df['lag1'] = (df.index - pd.Timedelta('364 days')).map(target_map)
  return df
In [168]:
df = add_lags(df)
In [169]:
df.tail()
Out[169]:
daily_yoy_ndt.total_inside_sales x5_min_pop x1_2_mile_emp square_feet hi_flow_lanes womens_sink_count x1_mile_pop x7_min_pop x1_2_mile_pop dayofweek quarter month year dayofyear dayofmonth lag1
calendar.calendar_day_date
2023-08-12 4160.1840 15721.0 656.0 2933.0 1.0 2.0 3629.0 26142.0 1003.0 5 3 8 2023 224 12 3775.6565
2023-08-13 3378.9595 15721.0 656.0 2933.0 1.0 2.0 3629.0 26142.0 1003.0 6 3 8 2023 225 13 2600.2270
2023-08-14 3156.0760 15721.0 656.0 2933.0 1.0 2.0 3629.0 26142.0 1003.0 0 3 8 2023 226 14 3868.9805
2023-08-15 3331.7060 15721.0 656.0 2933.0 1.0 2.0 3629.0 26142.0 1003.0 1 3 8 2023 227 15 3613.2180
2023-08-16 3389.2145 15721.0 656.0 2933.0 1.0 2.0 3629.0 26142.0 1003.0 2 3 8 2023 228 16 4138.0465

Fitting the Model to calculate the Evaluation Metrics

In [170]:
tss = TimeSeriesSplit(n_splits=5)
df = df.sort_index()


fold = 0
preds = []
scores = []
for train_idx, val_idx in tss.split(df):
    train = df.iloc[train_idx]
    test = df.iloc[val_idx]


    TARGET = 'daily_yoy_ndt.total_inside_sales'

    X_train = train
    y_train = train[TARGET]

    X_test = test
    y_test = test[TARGET]

    reg = xgb.XGBRegressor(base_score=0.5, booster='gbtree',
                           n_estimators=1000,
                           early_stopping_rounds=50,
                           objective='reg:linear',
                           max_depth=3,
                           learning_rate=0.01)
    reg.fit(X_train, y_train,
            eval_set=[(X_train, y_train), (X_test, y_test)],
            verbose=100)

    y_pred = reg.predict(X_test)
    preds.append(y_pred)
    score = np.sqrt(mean_squared_error(y_test, y_pred))
    scores.append(score)
[21:57:50] WARNING: C:/Users/administrator/workspace/xgboost-win64_release_1.6.0/src/objective/regression_obj.cu:203: reg:linear is now deprecated in favor of reg:squarederror.
[0]	validation_0-rmse:3212.66713	validation_1-rmse:2766.08962
[100]	validation_0-rmse:1187.68306	validation_1-rmse:1017.85356
[200]	validation_0-rmse:440.50747	validation_1-rmse:375.21250
[300]	validation_0-rmse:164.36818	validation_1-rmse:139.09861
[400]	validation_0-rmse:61.98767	validation_1-rmse:52.37402
[500]	validation_0-rmse:23.92140	validation_1-rmse:20.79063
[600]	validation_0-rmse:9.66659	validation_1-rmse:10.12803
[700]	validation_0-rmse:4.31186	validation_1-rmse:7.17832
[800]	validation_0-rmse:2.18103	validation_1-rmse:6.43335
[900]	validation_0-rmse:1.26939	validation_1-rmse:6.20607
[999]	validation_0-rmse:0.86621	validation_1-rmse:6.10832
[21:57:54] WARNING: C:/Users/administrator/workspace/xgboost-win64_release_1.6.0/src/objective/regression_obj.cu:203: reg:linear is now deprecated in favor of reg:squarederror.
[0]	validation_0-rmse:2997.70929	validation_1-rmse:2714.77778
[100]	validation_0-rmse:1105.70732	validation_1-rmse:999.16138
[200]	validation_0-rmse:409.11247	validation_1-rmse:373.23616
[300]	validation_0-rmse:152.30692	validation_1-rmse:147.60784
[400]	validation_0-rmse:57.38326	validation_1-rmse:67.41317
[500]	validation_0-rmse:22.21636	validation_1-rmse:37.86811
[600]	validation_0-rmse:9.17142	validation_1-rmse:25.11995
[700]	validation_0-rmse:4.30010	validation_1-rmse:19.00206
[800]	validation_0-rmse:2.36521	validation_1-rmse:15.68220
[900]	validation_0-rmse:1.48334	validation_1-rmse:13.76809
[999]	validation_0-rmse:1.06785	validation_1-rmse:12.64953
[21:57:58] WARNING: C:/Users/administrator/workspace/xgboost-win64_release_1.6.0/src/objective/regression_obj.cu:203: reg:linear is now deprecated in favor of reg:squarederror.
[0]	validation_0-rmse:2906.45205	validation_1-rmse:3128.17480
[100]	validation_0-rmse:1071.82029	validation_1-rmse:1151.00751
[200]	validation_0-rmse:396.66625	validation_1-rmse:423.76890
[300]	validation_0-rmse:147.40999	validation_1-rmse:156.27974
[400]	validation_0-rmse:55.13333	validation_1-rmse:57.89301
[500]	validation_0-rmse:20.90241	validation_1-rmse:21.58088
[600]	validation_0-rmse:8.20601	validation_1-rmse:8.30579
[700]	validation_0-rmse:3.51627	validation_1-rmse:3.72868
[800]	validation_0-rmse:1.81106	validation_1-rmse:2.48206
[900]	validation_0-rmse:1.16227	validation_1-rmse:2.22583
[999]	validation_0-rmse:0.92522	validation_1-rmse:2.18174
[21:58:03] WARNING: C:/Users/administrator/workspace/xgboost-win64_release_1.6.0/src/objective/regression_obj.cu:203: reg:linear is now deprecated in favor of reg:squarederror.
[0]	validation_0-rmse:2963.41316	validation_1-rmse:3115.70044
[100]	validation_0-rmse:1091.77617	validation_1-rmse:1144.72511
[200]	validation_0-rmse:403.60357	validation_1-rmse:421.24003
[300]	validation_0-rmse:149.83008	validation_1-rmse:155.07669
[400]	validation_0-rmse:55.95566	validation_1-rmse:57.33828
[500]	validation_0-rmse:21.15694	validation_1-rmse:21.29336
[600]	validation_0-rmse:8.25735	validation_1-rmse:8.03381
[700]	validation_0-rmse:3.49420	validation_1-rmse:3.32706
[800]	validation_0-rmse:1.80140	validation_1-rmse:1.94748
[900]	validation_0-rmse:1.18361	validation_1-rmse:1.65098
[999]	validation_0-rmse:0.96905	validation_1-rmse:1.60629
[21:58:08] WARNING: C:/Users/administrator/workspace/xgboost-win64_release_1.6.0/src/objective/regression_obj.cu:203: reg:linear is now deprecated in favor of reg:squarederror.
[0]	validation_0-rmse:2994.49702	validation_1-rmse:2909.36275
[100]	validation_0-rmse:1102.85256	validation_1-rmse:1066.14230
[200]	validation_0-rmse:407.47457	validation_1-rmse:391.42662
[300]	validation_0-rmse:151.14409	validation_1-rmse:144.16329
[400]	validation_0-rmse:56.37829	validation_1-rmse:53.31715
[500]	validation_0-rmse:21.26931	validation_1-rmse:19.86655
[600]	validation_0-rmse:8.26391	validation_1-rmse:7.65255
[700]	validation_0-rmse:3.46166	validation_1-rmse:3.46083
[800]	validation_0-rmse:1.77951	validation_1-rmse:2.35544
[900]	validation_0-rmse:1.18362	validation_1-rmse:2.12154
[999]	validation_0-rmse:0.98345	validation_1-rmse:2.07254

Calculation of evaluation metrics

In [171]:
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

# Print the evaluation metrics
print(f'Mean Absolute Error (MAE): {mae:.2f}')
print(f'Mean Squared Error (MSE): {mse:.2f}')
print(f'R-squared (R2) Score: {r2:.2f}')
Mean Absolute Error (MAE): 0.90
Mean Squared Error (MSE): 4.30
R-squared (R2) Score: 1.00
In [172]:
print(f'Score across folds {np.mean(scores):0.4f}')
print(f'Fold scores:{scores}')
Score across folds 4.9237
Fold scores:[6.108319557071602, 12.64952762771081, 2.1817435813609154, 1.606288275418703, 2.0725409138716193]

Retraining all data

In [173]:
future = pd.date_range('2023-08-17','2024-12-31', freq='D')
future_df = pd.DataFrame(index=future)
df['isFuture'] = False

TARGET = 'daily_yoy_ndt.total_inside_sales'

# Define features (excluding the target and 'isFuture' column)
FEATURES = ['x5_min_pop', 'x1_2_mile_emp', 'square_feet', 'hi_flow_lanes',
            'womens_sink_count', 'x1_mile_pop', 'x7_min_pop', 'x1_2_mile_pop',
            'dayofweek', 'quarter', 'month', 'year', 'dayofyear',
            'dayofmonth', 'lag1', 'isFuture']

# Split the data into features (X) and target (y)
X = df[FEATURES]
y = df[TARGET]

# Create the XGBoost model
reg = xgb.XGBRegressor(base_score=0.5,
                       booster='gbtree',
                       n_estimators=500,
                       objective='reg:linear',
                       max_depth=3,
                       learning_rate=0.01)

# Fit the model
reg.fit(X, y, eval_set=[(X, y)], verbose=100)
[21:58:15] WARNING: C:/Users/administrator/workspace/xgboost-win64_release_1.6.0/src/objective/regression_obj.cu:203: reg:linear is now deprecated in favor of reg:squarederror.
[0]	validation_0-rmse:2982.21578
[100]	validation_0-rmse:1283.44364
[200]	validation_0-rmse:772.52244
[300]	validation_0-rmse:629.97311
[400]	validation_0-rmse:573.22153
[499]	validation_0-rmse:535.96527
Out[173]:
XGBRegressor(base_score=0.5, booster='gbtree', callbacks=None,
             colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1,
             early_stopping_rounds=None, enable_categorical=False,
             eval_metric=None, gamma=0, gpu_id=-1, grow_policy='depthwise',
             importance_type=None, interaction_constraints='',
             learning_rate=0.01, max_bin=256, max_cat_to_onehot=4,
             max_delta_step=0, max_depth=3, max_leaves=0, min_child_weight=1,
             missing=nan, monotone_constraints='()', n_estimators=500, n_jobs=0,
             num_parallel_tree=1, objective='reg:linear', predictor='auto',
             random_state=0, reg_alpha=0, ...)

Creating Future Dataframe

In [174]:
future = pd.date_range('2023-08-17','2024-12-31', freq='D')
future_df = pd.DataFrame(index=future)
future_df['isFuture'] = True
df['isFuture'] = False
df_and_future = pd.concat([df, future_df])
df_and_future = create_features(df_and_future)
df_and_future = add_lags(df_and_future)
In [175]:
future_w_features = df_and_future.query('isFuture').copy()
In [176]:
future_w_features.head()
Out[176]:
daily_yoy_ndt.total_inside_sales x5_min_pop x1_2_mile_emp square_feet hi_flow_lanes womens_sink_count x1_mile_pop x7_min_pop x1_2_mile_pop dayofweek quarter month year dayofyear dayofmonth lag1 isFuture
2023-08-17 NaN NaN NaN NaN NaN NaN NaN NaN NaN 3 3 8 2023 229 17 3871.1925 True
2023-08-18 NaN NaN NaN NaN NaN NaN NaN NaN NaN 4 3 8 2023 230 18 3635.4395 True
2023-08-19 NaN NaN NaN NaN NaN NaN NaN NaN NaN 5 3 8 2023 231 19 2902.2000 True
2023-08-20 NaN NaN NaN NaN NaN NaN NaN NaN NaN 6 3 8 2023 232 20 2724.3825 True
2023-08-21 NaN NaN NaN NaN NaN NaN NaN NaN NaN 0 3 8 2023 233 21 4579.2355 True
In [177]:
future_w_features = future_w_features.drop('daily_yoy_ndt.total_inside_sales', axis=1)
In [178]:
future_w_features['pred'] = reg.predict(future_w_features)
In [179]:
future_w_features.index = pd.to_datetime(future_w_features.index)  # Make sure index is datetime
plt.figure(figsize=(10, 5))
plt.plot(future_w_features.index, future_w_features['pred'], color='r', marker='o', linestyle='-', linewidth=1, markersize=3)
plt.title('Future Predictions For Inside Sales ')
plt.xlabel('Date')
plt.ylabel('Inside Sales')
plt.grid(True)
plt.show()

Combining Future Prediction with the Historical Data

In [180]:
plt.figure(figsize=(15, 5))

# Plot the entire historical data
encoded_df['daily_yoy_ndt.total_inside_sales'].plot(label='Historical Data', color='b')
plt.axvline('2023-08-01', color='black', ls='--')
# Plot future predictions
plt.plot(future_w_features.index, future_w_features['pred'], label='Future Predictions', color='red', linestyle='dashed')

# Set title and labels
plt.title('Historical Data and Future Predictions of Total Inside Sales')
plt.xlabel('Date')
plt.ylabel('Value')

# Add legend
plt.legend()

# Show plot
plt.show()

Displaying the data for future predictions

In [181]:
start_date = '2023-08-17'
end_date = pd.to_datetime(start_date) + pd.DateOffset(days=502)
date_range = pd.date_range(start=start_date, end=end_date, freq='D')


# Create a DataFrame for the forecasted sales
forecasted_sales_df = pd.DataFrame({'Date': date_range, 'Sales': future_w_features['pred']})

# Display the DataFrame
print(forecasted_sales_df)
                 Date        Sales
2023-08-17 2023-08-17  3135.995117
2023-08-18 2023-08-18  3226.890869
2023-08-19 2023-08-19  3033.454590
2023-08-20 2023-08-20  2554.895020
2023-08-21 2023-08-21  2930.319336
...               ...          ...
2024-12-27 2024-12-27  2617.350830
2024-12-28 2024-12-28  2447.573486
2024-12-29 2024-12-29  2030.570679
2024-12-30 2024-12-30  2378.852539
2024-12-31 2024-12-31  2378.852539

[503 rows x 2 columns]

C.Target Variable = Diesel

Plotting the graph for sale of diesel over the years

In [182]:
average_sales = encoded_df['diesel'].mean()
encoded_df['diesel'].plot(style=".", figsize=(15,5), title="Diesel Sale over the Years")
plt.axhline(y=average_sales, color='r', linestyle='-', label='Average Sales')
plt.legend()
plt.show()

Train Test Split

In [183]:
#splitting test as date less than 1st jan 2023 and train as later

train= encoded_df['diesel'].loc[encoded_df.index< '2023-01-01']
test= encoded_df['diesel'].loc[encoded_df.index>= '2023-01-01']
fig, ax= plt.subplots(figsize=(15,5))
train.plot(ax=ax, label='training set')
test.plot(ax=ax, label='Test set')
ax.axvline('2023-01-01',color='black',ls='--')
ax.legend(['Training Set','Test Set'])
plt.show()

Splitting data into training and testing sets

In [184]:
# Features (X) and Target (y)
X = encoded_df.drop(columns=['diesel'])
y = encoded_df['diesel']

# Split data into training and testing sets
train_end_date = '2023-01-01'
X_train = X[X.index < train_end_date]
X_test = X[X.index >= train_end_date]
y_train = y[y.index < train_end_date]
y_test = y[y.index >= train_end_date]

We now have X_train, X_test, y_train, and y_test ready for modeling

In [185]:
reg = xgb.XGBRegressor(base_score=0.5, booster='gbtree',
                       n_estimators=1000,
                       early_stopping_rounds=50,
                       objective='reg:linear',
                       max_depth=3,
                       learning_rate=0.01)
reg.fit(X_train, y_train,
        eval_set=[(X_train, y_train), (X_test, y_test)],
        verbose=100)
[21:58:19] WARNING: C:/Users/administrator/workspace/xgboost-win64_release_1.6.0/src/objective/regression_obj.cu:203: reg:linear is now deprecated in favor of reg:squarederror.
[0]	validation_0-rmse:2795.35556	validation_1-rmse:2233.65354
[100]	validation_0-rmse:1258.20495	validation_1-rmse:1407.52785
[200]	validation_0-rmse:756.19770	validation_1-rmse:1105.65761
[300]	validation_0-rmse:598.51426	validation_1-rmse:884.90520
[400]	validation_0-rmse:545.53442	validation_1-rmse:807.14130
[500]	validation_0-rmse:521.01554	validation_1-rmse:785.80954
[600]	validation_0-rmse:504.10290	validation_1-rmse:775.42567
[657]	validation_0-rmse:496.58099	validation_1-rmse:781.93512
Out[185]:
XGBRegressor(base_score=0.5, booster='gbtree', callbacks=None,
             colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1,
             early_stopping_rounds=50, enable_categorical=False,
             eval_metric=None, gamma=0, gpu_id=-1, grow_policy='depthwise',
             importance_type=None, interaction_constraints='',
             learning_rate=0.01, max_bin=256, max_cat_to_onehot=4,
             max_delta_step=0, max_depth=3, max_leaves=0, min_child_weight=1,
             missing=nan, monotone_constraints='()', n_estimators=1000,
             n_jobs=0, num_parallel_tree=1, objective='reg:linear',
             predictor='auto', random_state=0, reg_alpha=0, ...)

Identifying Top 10 features impacting the Diesel Sale

In [186]:
fi = pd.DataFrame(data=reg.feature_importances_,
             index=reg.feature_names_in_,
             columns=['importance'])
top_10_features = fi.sort_values(by='importance', ascending=False).head(10)
top_10_features.plot(kind='barh', title='Top 10 Feature Importance')
Out[186]:
<matplotlib.axes._subplots.AxesSubplot at 0x2319182feb8>

Plotting the graph to depict Train and Test Split

In [187]:
train = encoded_df[['diesel','hi_flow_lanes_fueling_positions','x1_mile_emp',
'womens_toilet_count','daily_yoy_ndt.total_food_service','rv_lanes_fueling_positions','site_id_msba','traditional_forecourt_fueling_positions','x1_mile_pop']].loc[encoded_df.index < '2023-01-01']
test = encoded_df[['diesel','hi_flow_lanes_fueling_positions','x1_mile_emp',
'womens_toilet_count','daily_yoy_ndt.total_food_service','rv_lanes_fueling_positions','site_id_msba','traditional_forecourt_fueling_positions','x1_mile_pop']].loc[encoded_df.index >= '2023-01-01']

fig, ax = plt.subplots(figsize=(15, 5))
train['diesel'].plot(ax=ax, label='Training Set', title='Data Train/Test Split')
test['diesel'].plot(ax=ax, label='Test Set')
ax.axvline('2023-01-01', color='black', ls='--')
ax.legend(['Training Set', 'Test Set'])
plt.show()

Time series cross validation

In [188]:
tss = TimeSeriesSplit(n_splits=5)
df = encoded_df[['diesel','hi_flow_lanes_fueling_positions','x1_mile_emp',
'womens_toilet_count','daily_yoy_ndt.total_food_service','rv_lanes_fueling_positions','site_id_msba','traditional_forecourt_fueling_positions','x1_mile_pop']].sort_index()
In [189]:
fig, axs = plt.subplots(5, 1, figsize=(15, 15), sharex=True)

fold = 0
for train_idx, val_idx in tss.split(df):
    train = df.iloc[train_idx]
    test = df.iloc[val_idx]
    train['diesel'].plot(ax=axs[fold],
                          label='Training Set',
                          title=f'Data Train/Test Split Fold {fold}')
    test['diesel'].plot(ax=axs[fold],
                         label='Test Set')
    axs[fold].axvline(test.index.min(), color='black', ls='--')
    fold += 1
plt.show()

Creating time series features based on time series index

In [190]:
def create_features(df):
  
    df = df.copy()
    df['dayofweek'] = df.index.dayofweek
    df['quarter'] = df.index.quarter
    df['month'] = df.index.month
    df['year'] = df.index.year
    df['dayofyear'] = df.index.dayofyear
    df['dayofmonth'] = df.index.day
    return df

df = create_features(df)

Defining Lags

In [191]:
def add_lags(df):
  target_map= df['diesel'].to_dict()
  df['lag1'] = (df.index - pd.Timedelta('364 days')).map(target_map)
  return df
In [192]:
df = add_lags(df)
In [193]:
df.tail()
Out[193]:
diesel hi_flow_lanes_fueling_positions x1_mile_emp womens_toilet_count daily_yoy_ndt.total_food_service rv_lanes_fueling_positions site_id_msba traditional_forecourt_fueling_positions x1_mile_pop dayofweek quarter month year dayofyear dayofmonth lag1
calendar.calendar_day_date
2023-08-12 775.0610 0.0 1074.0 2.0 843.5980 0.0 24150 16.0 3629.0 5 3 8 2023 224 12 592.6515
2023-08-13 1054.0110 0.0 1074.0 2.0 719.3655 0.0 24150 16.0 3629.0 6 3 8 2023 225 13 554.3125
2023-08-14 970.6655 0.0 1074.0 2.0 772.2015 0.0 24150 16.0 3629.0 0 3 8 2023 226 14 3315.3785
2023-08-15 1066.5375 0.0 1074.0 2.0 843.6260 0.0 24150 16.0 3629.0 1 3 8 2023 227 15 3843.2905
2023-08-16 811.8005 0.0 1074.0 2.0 849.2155 0.0 24150 16.0 3629.0 2 3 8 2023 228 16 2447.0145

Fitting the Model to calculate Evaluation Metrics

In [194]:
tss = TimeSeriesSplit(n_splits=5)
df = df.sort_index()


fold = 0
preds = []
scores = []
for train_idx, val_idx in tss.split(df):
    train = df.iloc[train_idx]
    test = df.iloc[val_idx]


    TARGET = 'diesel'

    X_train = train
    y_train = train[TARGET]

    X_test = test
    y_test = test[TARGET]

    reg = xgb.XGBRegressor(base_score=0.5, booster='gbtree',
                           n_estimators=1000,
                           early_stopping_rounds=50,
                           objective='reg:linear',
                           max_depth=3,
                           learning_rate=0.01)
    reg.fit(X_train, y_train,
            eval_set=[(X_train, y_train), (X_test, y_test)],
            verbose=100)

    y_pred = reg.predict(X_test)
    preds.append(y_pred)
    score = np.sqrt(mean_squared_error(y_test, y_pred))
    scores.append(score)
[21:58:38] WARNING: C:/Users/administrator/workspace/xgboost-win64_release_1.6.0/src/objective/regression_obj.cu:203: reg:linear is now deprecated in favor of reg:squarederror.
[0]	validation_0-rmse:2056.98095	validation_1-rmse:1931.15989
[100]	validation_0-rmse:769.43963	validation_1-rmse:721.10420
[200]	validation_0-rmse:289.50125	validation_1-rmse:271.06784
[300]	validation_0-rmse:110.31980	validation_1-rmse:104.01348
[400]	validation_0-rmse:43.02471	validation_1-rmse:41.43529
[500]	validation_0-rmse:17.50451	validation_1-rmse:18.22301
[600]	validation_0-rmse:7.70881	validation_1-rmse:9.85620
[700]	validation_0-rmse:3.94146	validation_1-rmse:7.09750
[800]	validation_0-rmse:2.24694	validation_1-rmse:6.02644
[900]	validation_0-rmse:1.38748	validation_1-rmse:5.58194
[999]	validation_0-rmse:0.94504	validation_1-rmse:5.39123
[21:58:41] WARNING: C:/Users/administrator/workspace/xgboost-win64_release_1.6.0/src/objective/regression_obj.cu:203: reg:linear is now deprecated in favor of reg:squarederror.
[0]	validation_0-rmse:1995.00842	validation_1-rmse:2638.82200
[100]	validation_0-rmse:743.31887	validation_1-rmse:1329.62411
[200]	validation_0-rmse:278.49585	validation_1-rmse:889.61394
[300]	validation_0-rmse:105.49155	validation_1-rmse:732.68046
[400]	validation_0-rmse:40.66849	validation_1-rmse:665.29711
[500]	validation_0-rmse:16.15843	validation_1-rmse:632.19551
[600]	validation_0-rmse:6.74729	validation_1-rmse:614.07087
[700]	validation_0-rmse:3.05880	validation_1-rmse:605.03875
[800]	validation_0-rmse:1.65862	validation_1-rmse:600.30747
[900]	validation_0-rmse:1.08514	validation_1-rmse:597.31888
[999]	validation_0-rmse:0.85504	validation_1-rmse:595.51463
[21:58:44] WARNING: C:/Users/administrator/workspace/xgboost-win64_release_1.6.0/src/objective/regression_obj.cu:203: reg:linear is now deprecated in favor of reg:squarederror.
[0]	validation_0-rmse:2228.65509	validation_1-rmse:3460.93895
[100]	validation_0-rmse:839.64227	validation_1-rmse:1621.92340
[200]	validation_0-rmse:320.68899	validation_1-rmse:845.32231
[300]	validation_0-rmse:125.28983	validation_1-rmse:505.16779
[400]	validation_0-rmse:51.17164	validation_1-rmse:354.00350
[500]	validation_0-rmse:22.73387	validation_1-rmse:283.28069
[600]	validation_0-rmse:11.58823	validation_1-rmse:244.04315
[700]	validation_0-rmse:6.47983	validation_1-rmse:221.45985
[800]	validation_0-rmse:3.82968	validation_1-rmse:208.23313
[900]	validation_0-rmse:2.41155	validation_1-rmse:200.41498
[999]	validation_0-rmse:1.67199	validation_1-rmse:195.79431
[21:58:49] WARNING: C:/Users/administrator/workspace/xgboost-win64_release_1.6.0/src/objective/regression_obj.cu:203: reg:linear is now deprecated in favor of reg:squarederror.
[0]	validation_0-rmse:2590.65177	validation_1-rmse:3487.16250
[100]	validation_0-rmse:977.81282	validation_1-rmse:1300.27251
[200]	validation_0-rmse:372.21835	validation_1-rmse:487.94890
[300]	validation_0-rmse:143.89863	validation_1-rmse:184.05172
[400]	validation_0-rmse:57.70486	validation_1-rmse:71.56786
[500]	validation_0-rmse:24.91438	validation_1-rmse:30.62896
[600]	validation_0-rmse:12.01284	validation_1-rmse:17.64480
[700]	validation_0-rmse:6.64499	validation_1-rmse:14.51394
[800]	validation_0-rmse:3.95145	validation_1-rmse:13.76373
[900]	validation_0-rmse:2.56763	validation_1-rmse:13.55845
[999]	validation_0-rmse:1.89204	validation_1-rmse:13.49568
[21:58:54] WARNING: C:/Users/administrator/workspace/xgboost-win64_release_1.6.0/src/objective/regression_obj.cu:203: reg:linear is now deprecated in favor of reg:squarederror.
[0]	validation_0-rmse:2793.02493	validation_1-rmse:2349.68111
[100]	validation_0-rmse:1050.98284	validation_1-rmse:892.69791
[200]	validation_0-rmse:398.36512	validation_1-rmse:334.56436
[300]	validation_0-rmse:153.15500	validation_1-rmse:127.79786
[400]	validation_0-rmse:60.72607	validation_1-rmse:49.90619
[500]	validation_0-rmse:25.68449	validation_1-rmse:19.92910
[600]	validation_0-rmse:12.01727	validation_1-rmse:8.93485
[700]	validation_0-rmse:6.54239	validation_1-rmse:5.64866
[800]	validation_0-rmse:3.92265	validation_1-rmse:4.49343
[900]	validation_0-rmse:2.61467	validation_1-rmse:4.16645
[999]	validation_0-rmse:1.99770	validation_1-rmse:4.10915

Calculation of evaluation metrics

In [195]:
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

# Print the evaluation metrics
print(f'Mean Absolute Error (MAE): {mae:.2f}')
print(f'Mean Squared Error (MSE): {mse:.2f}')
print(f'R-squared (R2) Score: {r2:.2f}')
Mean Absolute Error (MAE): 1.66
Mean Squared Error (MSE): 16.89
R-squared (R2) Score: 1.00
In [196]:
print(f'Score across folds {np.mean(scores):0.4f}')
print(f'Fold scores:{scores}')
Score across folds 162.8610
Fold scores:[5.391235451494224, 595.5146331882756, 195.79431784659573, 13.495669455179753, 4.109145668884052]

Retraining all data

In [197]:
future = pd.date_range('2023-08-01','2024-12-31', freq='D')
future_df = pd.DataFrame(index=future)
df['isFuture'] = False

# Define your target variable
TARGET = 'diesel'

# Define features (excluding the target and 'isFuture' column)
FEATURES = ['hi_flow_lanes_fueling_positions','x1_mile_emp',
'womens_toilet_count','daily_yoy_ndt.total_food_service','rv_lanes_fueling_positions','site_id_msba','traditional_forecourt_fueling_positions','x1_mile_pop',
            'dayofweek', 'quarter', 'month', 'year', 'dayofyear',
            'dayofmonth','lag1', 'isFuture']

# Split the data into features (X) and target (y)
X = df[FEATURES]
y = df[TARGET]

# Create the XGBoost model
reg = xgb.XGBRegressor(base_score=0.5,
                       booster='gbtree',
                       n_estimators=500,
                       objective='reg:linear',
                       max_depth=3,
                       learning_rate=0.01)

# Fit the model
reg.fit(X, y, eval_set=[(X, y)], verbose=100)
[21:59:01] WARNING: C:/Users/administrator/workspace/xgboost-win64_release_1.6.0/src/objective/regression_obj.cu:203: reg:linear is now deprecated in favor of reg:squarederror.
[0]	validation_0-rmse:2726.67697
[100]	validation_0-rmse:1279.14363
[200]	validation_0-rmse:786.40165
[300]	validation_0-rmse:639.75821
[400]	validation_0-rmse:586.72424
[499]	validation_0-rmse:559.82025
Out[197]:
XGBRegressor(base_score=0.5, booster='gbtree', callbacks=None,
             colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1,
             early_stopping_rounds=None, enable_categorical=False,
             eval_metric=None, gamma=0, gpu_id=-1, grow_policy='depthwise',
             importance_type=None, interaction_constraints='',
             learning_rate=0.01, max_bin=256, max_cat_to_onehot=4,
             max_delta_step=0, max_depth=3, max_leaves=0, min_child_weight=1,
             missing=nan, monotone_constraints='()', n_estimators=500, n_jobs=0,
             num_parallel_tree=1, objective='reg:linear', predictor='auto',
             random_state=0, reg_alpha=0, ...)

Creating Future Dataframe

In [198]:
future = pd.date_range('2023-08-17','2024-12-31', freq='D')
future_df = pd.DataFrame(index=future)
future_df['isFuture'] = True
df['isFuture'] = False
df_and_future = pd.concat([df, future_df])
df_and_future = create_features(df_and_future)
df_and_future = add_lags(df_and_future)
In [199]:
future_w_features = df_and_future.query('isFuture').copy()
In [200]:
future_w_features.head()
Out[200]:
diesel hi_flow_lanes_fueling_positions x1_mile_emp womens_toilet_count daily_yoy_ndt.total_food_service rv_lanes_fueling_positions site_id_msba traditional_forecourt_fueling_positions x1_mile_pop dayofweek quarter month year dayofyear dayofmonth lag1 isFuture
2023-08-17 NaN NaN NaN NaN NaN NaN NaN NaN NaN 3 3 8 2023 229 17 1967.9555 True
2023-08-18 NaN NaN NaN NaN NaN NaN NaN NaN NaN 4 3 8 2023 230 18 537.1450 True
2023-08-19 NaN NaN NaN NaN NaN NaN NaN NaN NaN 5 3 8 2023 231 19 117.9885 True
2023-08-20 NaN NaN NaN NaN NaN NaN NaN NaN NaN 6 3 8 2023 232 20 331.0055 True
2023-08-21 NaN NaN NaN NaN NaN NaN NaN NaN NaN 0 3 8 2023 233 21 3809.0605 True
In [201]:
future_w_features = future_w_features.drop('diesel', axis=1)
In [202]:
future_w_features['pred'] = reg.predict(future_w_features)
In [203]:
future_w_features.index = pd.to_datetime(future_w_features.index)  # Make sure index is datetime
plt.figure(figsize=(10, 5))
plt.plot(future_w_features.index, future_w_features['pred'], color='r', marker='o', linestyle='-', linewidth=1, markersize=3)
plt.title('Future Predictions For Diesel ')
plt.xlabel('Date')
plt.ylabel('Diesel')
plt.grid(True)
plt.show()

Combining Future Prediction with the Historical Data

In [204]:
plt.figure(figsize=(15, 5))


encoded_df['diesel'].plot(label='Historical Data', color='b')
plt.axvline('2023-08-01', color='black', ls='--')
# Plot future predictions
plt.plot(future_w_features.index, future_w_features['pred'], label='Future Predictions', color='red', linestyle='dashed')

# Set title and labels
plt.title('Historical Data and Future Predictions of Diesel Sales')
plt.xlabel('Date')
plt.ylabel('Value')

plt.legend()

plt.show()

Displaying the data for future predictions

In [205]:
start_date = '2023-08-17'
end_date = pd.to_datetime(start_date) + pd.DateOffset(days=502)
date_range = pd.date_range(start=start_date, end=end_date, freq='D')


# Create a DataFrame for the forecasted sales
forecasted_sales_df = pd.DataFrame({'Date': date_range, 'Sales': future_w_features['pred']})

# Display the DataFrame
print(forecasted_sales_df)
                 Date       Sales
2023-08-17 2023-08-17  385.412079
2023-08-18 2023-08-18  288.672363
2023-08-19 2023-08-19  275.999298
2023-08-20 2023-08-20  275.716888
2023-08-21 2023-08-21  386.814056
...               ...         ...
2024-12-27 2024-12-27  263.729614
2024-12-28 2024-12-28  251.056610
2024-12-29 2024-12-29  250.774216
2024-12-30 2024-12-30  337.719269
2024-12-31 2024-12-31  337.719269

[503 rows x 2 columns]

D.Target Variable = Unleaded

Plotting the graph for sale of Unleaded over the years

In [206]:
average_sales = encoded_df['unleaded'].mean()
encoded_df['unleaded'].plot(style=".", figsize=(15,5), title="Unleaded Sale over the Years")
plt.axhline(y=average_sales, color='r', linestyle='-', label='Average Sales')
plt.legend()
plt.show()

Train Test Split

In [207]:
train= encoded_df['unleaded'].loc[encoded_df.index< '2023-01-01']
test= encoded_df['unleaded'].loc[encoded_df.index>= '2023-01-01']
fig, ax= plt.subplots(figsize=(15,5))
train.plot(ax=ax, label='training set')
test.plot(ax=ax, label='Test set')
ax.axvline('2023-01-01',color='black',ls='--')
ax.legend(['Training Set','Test Set'])
plt.show()

Splitting data into training and testing sets

In [208]:
X = encoded_df.drop(columns=['unleaded'])
y = encoded_df['unleaded']

# Split data into training and testing sets
train_end_date = '2023-01-01'
X_train = X[X.index < train_end_date]
X_test = X[X.index >= train_end_date]
y_train = y[y.index < train_end_date]
y_test = y[y.index >= train_end_date]

We now have X_train, X_test, y_train, and y_test ready for modeling

In [209]:
reg = xgb.XGBRegressor(base_score=0.5, booster='gbtree',
                       n_estimators=1000,
                       early_stopping_rounds=50,
                       objective='reg:linear',
                       max_depth=3,
                       learning_rate=0.01)
reg.fit(X_train, y_train,
        eval_set=[(X_train, y_train), (X_test, y_test)],
        verbose=100)
[21:59:07] WARNING: C:/Users/administrator/workspace/xgboost-win64_release_1.6.0/src/objective/regression_obj.cu:203: reg:linear is now deprecated in favor of reg:squarederror.
[0]	validation_0-rmse:2608.48035	validation_1-rmse:2299.91248
[100]	validation_0-rmse:1140.09762	validation_1-rmse:1014.64180
[200]	validation_0-rmse:654.31406	validation_1-rmse:617.97718
[300]	validation_0-rmse:509.91423	validation_1-rmse:525.07275
[400]	validation_0-rmse:451.49059	validation_1-rmse:490.80125
[500]	validation_0-rmse:417.55222	validation_1-rmse:460.59825
[600]	validation_0-rmse:397.49523	validation_1-rmse:445.59669
[700]	validation_0-rmse:384.14174	validation_1-rmse:435.94254
[800]	validation_0-rmse:372.75246	validation_1-rmse:427.81512
[900]	validation_0-rmse:363.16875	validation_1-rmse:422.25175
[999]	validation_0-rmse:355.91623	validation_1-rmse:417.11593
Out[209]:
XGBRegressor(base_score=0.5, booster='gbtree', callbacks=None,
             colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1,
             early_stopping_rounds=50, enable_categorical=False,
             eval_metric=None, gamma=0, gpu_id=-1, grow_policy='depthwise',
             importance_type=None, interaction_constraints='',
             learning_rate=0.01, max_bin=256, max_cat_to_onehot=4,
             max_delta_step=0, max_depth=3, max_leaves=0, min_child_weight=1,
             missing=nan, monotone_constraints='()', n_estimators=1000,
             n_jobs=0, num_parallel_tree=1, objective='reg:linear',
             predictor='auto', random_state=0, reg_alpha=0, ...)

Identifying Top 10 features impacting the Food Sale

In [210]:
fi = pd.DataFrame(data=reg.feature_importances_,
             index=reg.feature_names_in_,
             columns=['importance'])
top_10_features = fi.sort_values(by='importance', ascending=False).head(10)
top_10_features.plot(kind='barh', title='Top 10 Feature Importance')
Out[210]:
<matplotlib.axes._subplots.AxesSubplot at 0x23190d590f0>

Plotting the graph to depict Train and Test Split

In [211]:
train = encoded_df[['unleaded','mens_toilet_count','x1_mile_income', 'traditional_forecourt_stack_type','square_feet','lottery','years_since_last_project','x7_min_emp','total_sales']].loc[encoded_df.index < '2023-01-01']
test = encoded_df[['unleaded','mens_toilet_count','x1_mile_income', 'traditional_forecourt_stack_type','square_feet','lottery','years_since_last_project','x7_min_emp','total_sales']].loc[encoded_df.index >= '2023-01-01']

fig, ax = plt.subplots(figsize=(15, 5))
train['unleaded'].plot(ax=ax, label='Training Set', title='Data Train/Test Split')
test['unleaded'].plot(ax=ax, label='Test Set')
ax.axvline('2023-01-01', color='black', ls='--')
ax.legend(['Training Set', 'Test Set'])
plt.show()

Time series cross validation

In [212]:
tss = TimeSeriesSplit(n_splits=5)
df = encoded_df[['unleaded','mens_toilet_count','x1_mile_income', 'traditional_forecourt_stack_type','square_feet','lottery','years_since_last_project','x7_min_emp','total_sales']].sort_index()
In [213]:
fig, axs = plt.subplots(5, 1, figsize=(15, 15), sharex=True)

fold = 0
for train_idx, val_idx in tss.split(df):
    train = df.iloc[train_idx]
    test = df.iloc[val_idx]
    train['unleaded'].plot(ax=axs[fold],
                          label='Training Set',
                          title=f'Data Train/Test Split Fold {fold}')
    test['unleaded'].plot(ax=axs[fold],
                         label='Test Set')
    axs[fold].axvline(test.index.min(), color='black', ls='--')
    fold += 1
plt.show()

Creating time series features based on time series index

In [214]:
def create_features(df):
  
    df = df.copy()
    df['dayofweek'] = df.index.dayofweek
    df['quarter'] = df.index.quarter
    df['month'] = df.index.month
    df['year'] = df.index.year
    df['dayofyear'] = df.index.dayofyear
    df['dayofmonth'] = df.index.day
    return df

df = create_features(df)

Defining Lags

In [215]:
def add_lags(df):
  target_map= df['unleaded'].to_dict()
  df['lag1'] = (df.index - pd.Timedelta('364 days')).map(target_map)
  return df
In [216]:
df = add_lags(df)
In [217]:
df.tail()
Out[217]:
unleaded mens_toilet_count x1_mile_income traditional_forecourt_stack_type square_feet lottery years_since_last_project x7_min_emp total_sales dayofweek quarter month year dayofyear dayofmonth lag1
calendar.calendar_day_date
2023-08-12 3625.6535 2.0 62410.0 0.0 2933.0 0.0 1.0 9630.0 5003.7820 5 3 8 2023 224 12 5688.7985
2023-08-13 3385.9315 2.0 62410.0 0.0 2933.0 0.0 1.0 9630.0 4098.3250 6 3 8 2023 225 13 3120.3095
2023-08-14 3203.5885 2.0 62410.0 0.0 2933.0 0.0 1.0 9630.0 3928.2775 0 3 8 2023 226 14 4044.7295
2023-08-15 2911.9055 2.0 62410.0 0.0 2933.0 0.0 1.0 9630.0 4175.3320 1 3 8 2023 227 15 3700.1405
2023-08-16 3155.2360 2.0 62410.0 0.0 2933.0 0.0 1.0 9630.0 4238.4300 2 3 8 2023 228 16 2027.1370

Fitting the Model to calculate Evaluation Metrics

In [218]:
tss = TimeSeriesSplit(n_splits=5)
df = df.sort_index()


fold = 0
preds = []
scores = []
for train_idx, val_idx in tss.split(df):
    train = df.iloc[train_idx]
    test = df.iloc[val_idx]


    TARGET = 'unleaded'

    X_train = train
    y_train = train[TARGET]

    X_test = test
    y_test = test[TARGET]

    reg = xgb.XGBRegressor(base_score=0.5, booster='gbtree',
                           n_estimators=1000,
                           early_stopping_rounds=50,
                           objective='reg:linear',
                           max_depth=3,
                           learning_rate=0.01)
    reg.fit(X_train, y_train,
            eval_set=[(X_train, y_train), (X_test, y_test)],
            verbose=100)

    y_pred = reg.predict(X_test)
    preds.append(y_pred)
    score = np.sqrt(mean_squared_error(y_test, y_pred))
    scores.append(score)
[21:59:30] WARNING: C:/Users/administrator/workspace/xgboost-win64_release_1.6.0/src/objective/regression_obj.cu:203: reg:linear is now deprecated in favor of reg:squarederror.
[0]	validation_0-rmse:2285.94773	validation_1-rmse:2552.91836
[100]	validation_0-rmse:843.49688	validation_1-rmse:1016.62201
[200]	validation_0-rmse:312.21191	validation_1-rmse:448.63285
[300]	validation_0-rmse:116.17374	validation_1-rmse:247.09893
[400]	validation_0-rmse:43.65743	validation_1-rmse:177.38432
[500]	validation_0-rmse:16.79893	validation_1-rmse:150.41631
[600]	validation_0-rmse:6.87352	validation_1-rmse:137.11285
[700]	validation_0-rmse:3.17343	validation_1-rmse:129.30540
[800]	validation_0-rmse:1.66533	validation_1-rmse:124.74254
[900]	validation_0-rmse:0.98335	validation_1-rmse:122.03339
[999]	validation_0-rmse:0.65166	validation_1-rmse:120.42414
[21:59:32] WARNING: C:/Users/administrator/workspace/xgboost-win64_release_1.6.0/src/objective/regression_obj.cu:203: reg:linear is now deprecated in favor of reg:squarederror.
[0]	validation_0-rmse:2422.45641	validation_1-rmse:2717.91039
[100]	validation_0-rmse:894.09886	validation_1-rmse:1061.15189
[200]	validation_0-rmse:331.08841	validation_1-rmse:449.98285
[300]	validation_0-rmse:123.36947	validation_1-rmse:229.87762
[400]	validation_0-rmse:46.46935	validation_1-rmse:150.38831
[500]	validation_0-rmse:17.83443	validation_1-rmse:118.77428
[600]	validation_0-rmse:7.08262	validation_1-rmse:105.20172
[700]	validation_0-rmse:3.08491	validation_1-rmse:98.95556
[800]	validation_0-rmse:1.53296	validation_1-rmse:95.88095
[900]	validation_0-rmse:0.95475	validation_1-rmse:94.04690
[999]	validation_0-rmse:0.73450	validation_1-rmse:92.93056
[21:59:36] WARNING: C:/Users/administrator/workspace/xgboost-win64_release_1.6.0/src/objective/regression_obj.cu:203: reg:linear is now deprecated in favor of reg:squarederror.
[0]	validation_0-rmse:2524.44580	validation_1-rmse:2865.54506
[100]	validation_0-rmse:932.61880	validation_1-rmse:1054.88697
[200]	validation_0-rmse:345.93331	validation_1-rmse:390.28721
[300]	validation_0-rmse:129.29143	validation_1-rmse:144.90220
[400]	validation_0-rmse:48.99342	validation_1-rmse:54.35093
[500]	validation_0-rmse:19.02506	validation_1-rmse:20.80759
[600]	validation_0-rmse:7.80421	validation_1-rmse:8.46931
[700]	validation_0-rmse:3.68091	validation_1-rmse:4.38235
[800]	validation_0-rmse:2.01202	validation_1-rmse:3.14294
[900]	validation_0-rmse:1.28905	validation_1-rmse:2.79365
[999]	validation_0-rmse:0.97914	validation_1-rmse:2.68787
[21:59:41] WARNING: C:/Users/administrator/workspace/xgboost-win64_release_1.6.0/src/objective/regression_obj.cu:203: reg:linear is now deprecated in favor of reg:squarederror.
[0]	validation_0-rmse:2613.90357	validation_1-rmse:2641.03170
[100]	validation_0-rmse:964.88820	validation_1-rmse:969.89395
[200]	validation_0-rmse:357.50031	validation_1-rmse:356.23858
[300]	validation_0-rmse:133.37560	validation_1-rmse:131.05780
[400]	validation_0-rmse:50.38617	validation_1-rmse:48.58559
[500]	validation_0-rmse:19.48982	validation_1-rmse:18.07865
[600]	validation_0-rmse:7.91913	validation_1-rmse:6.85358
[700]	validation_0-rmse:3.65548	validation_1-rmse:2.92433
[800]	validation_0-rmse:2.00652	validation_1-rmse:1.71159
[900]	validation_0-rmse:1.31974	validation_1-rmse:1.40923
[999]	validation_0-rmse:1.04009	validation_1-rmse:1.34782
[21:59:48] WARNING: C:/Users/administrator/workspace/xgboost-win64_release_1.6.0/src/objective/regression_obj.cu:203: reg:linear is now deprecated in favor of reg:squarederror.
[0]	validation_0-rmse:2619.34366	validation_1-rmse:2288.72771
[100]	validation_0-rmse:966.43892	validation_1-rmse:838.76535
[200]	validation_0-rmse:357.89002	validation_1-rmse:308.40437
[300]	validation_0-rmse:133.43546	validation_1-rmse:114.22476
[400]	validation_0-rmse:50.33263	validation_1-rmse:42.95588
[500]	validation_0-rmse:19.42094	validation_1-rmse:16.77693
[600]	validation_0-rmse:7.85111	validation_1-rmse:7.12117
[700]	validation_0-rmse:3.58243	validation_1-rmse:3.83230
[800]	validation_0-rmse:1.97206	validation_1-rmse:2.63411
[900]	validation_0-rmse:1.32063	validation_1-rmse:2.09614
[999]	validation_0-rmse:1.06305	validation_1-rmse:1.83113

Calculation of evaluation metrics

In [219]:
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)


print(f'Mean Absolute Error (MAE): {mae:.2f}')
print(f'Mean Squared Error (MSE): {mse:.2f}')
print(f'R-squared (R2) Score: {r2:.2f}')
Mean Absolute Error (MAE): 0.87
Mean Squared Error (MSE): 3.35
R-squared (R2) Score: 1.00
In [220]:
print(f'Score across folds {np.mean(scores):0.4f}')
print(f'Fold scores:{scores}')
Score across folds 43.8443
Fold scores:[120.42413989752355, 92.9305676065971, 2.687869906027182, 1.3478199421358776, 1.8311297010428342]

Retraining all data

In [221]:
future = pd.date_range('2023-08-01','2024-12-31', freq='D')
future_df = pd.DataFrame(index=future)
df['isFuture'] = False


TARGET = 'unleaded'


FEATURES = ['mens_toilet_count','x1_mile_income', 'traditional_forecourt_stack_type','square_feet','lottery','years_since_last_project','x7_min_emp','total_sales',
            'dayofweek', 'quarter', 'month', 'year', 'dayofyear',
            'dayofmonth', 'lag1', 'isFuture']


X = df[FEATURES]
y = df[TARGET]
reg = xgb.XGBRegressor(base_score=0.5,
                       booster='gbtree',
                       n_estimators=500,
                       objective='reg:linear',
                       max_depth=3,
                       learning_rate=0.01)

reg.fit(X, y, eval_set=[(X, y)], verbose=100)
[21:59:56] WARNING: C:/Users/administrator/workspace/xgboost-win64_release_1.6.0/src/objective/regression_obj.cu:203: reg:linear is now deprecated in favor of reg:squarederror.
[0]	validation_0-rmse:2569.51376
[100]	validation_0-rmse:1135.48688
[200]	validation_0-rmse:703.13822
[300]	validation_0-rmse:569.65933
[400]	validation_0-rmse:514.57376
[499]	validation_0-rmse:479.09994
Out[221]:
XGBRegressor(base_score=0.5, booster='gbtree', callbacks=None,
             colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1,
             early_stopping_rounds=None, enable_categorical=False,
             eval_metric=None, gamma=0, gpu_id=-1, grow_policy='depthwise',
             importance_type=None, interaction_constraints='',
             learning_rate=0.01, max_bin=256, max_cat_to_onehot=4,
             max_delta_step=0, max_depth=3, max_leaves=0, min_child_weight=1,
             missing=nan, monotone_constraints='()', n_estimators=500, n_jobs=0,
             num_parallel_tree=1, objective='reg:linear', predictor='auto',
             random_state=0, reg_alpha=0, ...)

Creating Future Dataframe

In [222]:
future = pd.date_range('2023-08-17','2024-12-31', freq='D')
future_df = pd.DataFrame(index=future)
future_df['isFuture'] = True
df['isFuture'] = False
df_and_future = pd.concat([df, future_df])
df_and_future = create_features(df_and_future)
df_and_future = add_lags(df_and_future)
In [223]:
future_w_features = df_and_future.query('isFuture').copy()
In [224]:
future_w_features.head()
Out[224]:
unleaded mens_toilet_count x1_mile_income traditional_forecourt_stack_type square_feet lottery years_since_last_project x7_min_emp total_sales dayofweek quarter month year dayofyear dayofmonth lag1 isFuture
2023-08-17 NaN NaN NaN NaN NaN NaN NaN NaN NaN 3 3 8 2023 229 17 3481.7650 True
2023-08-18 NaN NaN NaN NaN NaN NaN NaN NaN NaN 4 3 8 2023 230 18 2473.1070 True
2023-08-19 NaN NaN NaN NaN NaN NaN NaN NaN NaN 5 3 8 2023 231 19 1101.5025 True
2023-08-20 NaN NaN NaN NaN NaN NaN NaN NaN NaN 6 3 8 2023 232 20 2283.5190 True
2023-08-21 NaN NaN NaN NaN NaN NaN NaN NaN NaN 0 3 8 2023 233 21 2072.3850 True
In [225]:
future_w_features = future_w_features.drop('unleaded', axis=1)
In [226]:
future_w_features['pred'] = reg.predict(future_w_features)
In [227]:
future_w_features.index = pd.to_datetime(future_w_features.index)  # Make sure index is datetime
plt.figure(figsize=(10, 5))
plt.plot(future_w_features.index, future_w_features['pred'], color='r', marker='o', linestyle='-', linewidth=1, markersize=3)
plt.title('Future Predictions For Unleaded ')
plt.xlabel('Date')
plt.ylabel('Unleaded')
plt.grid(True)
plt.show()

Combining Future Prediction with the Historical Data

In [228]:
plt.figure(figsize=(15, 5))

encoded_df['unleaded'].plot(label='Historical Data', color='b')
plt.axvline('2023-08-17', color='black', ls='--')

plt.plot(future_w_features.index, future_w_features['pred'], label='Future Predictions', color='red', linestyle='dashed')


plt.title('Historical Data and Future Predictions of Unleaded Sales')
plt.xlabel('Date')
plt.ylabel('Value')


plt.legend()

plt.show()

Displaying the data for future predictions

In [229]:
start_date = '2023-08-17'
end_date = pd.to_datetime(start_date) + pd.DateOffset(days=502)
date_range = pd.date_range(start=start_date, end=end_date, freq='D')


forecasted_sales_df = pd.DataFrame({'Date': date_range, 'Sales': future_w_features['pred']})

print(forecasted_sales_df)
                 Date        Sales
2023-08-17 2023-08-17  1906.331177
2023-08-18 2023-08-18  1906.331177
2023-08-19 2023-08-19  1906.331177
2023-08-20 2023-08-20  1906.331177
2023-08-21 2023-08-21  1906.331177
...               ...          ...
2024-12-27 2024-12-27  1906.331177
2024-12-28 2024-12-28  1906.331177
2024-12-29 2024-12-29  1906.331177
2024-12-30 2024-12-30  1906.331177
2024-12-31 2024-12-31  1906.331177

[503 rows x 2 columns]

II. PROPHET

In [230]:
#!pip install pystan==2.19.1.1
In [231]:
#pip install --upgrade numpy
In [232]:
from prophet import Prophet
In [233]:
df_merged['calendar.calendar_day_date'] = t_data.index
df_merged.set_index('calendar.calendar_day_date', inplace=True)
In [235]:
import pandas as pd


categorical_columns = df_merged.select_dtypes(include=['object']).columns

encoded_df = pd.get_dummies(df_merged, columns=categorical_columns)
In [236]:
encoded_df.head()
Out[236]:
calendar.fiscal_week_id_for_year daily_yoy_ndt.total_inside_sales daily_yoy_ndt.total_food_service diesel unleaded site_id_msba calendar_year calendar_month calendar_day open_year ... calendar_information.holiday_Saint Valentine's Day calendar_information.holiday_Thanksgiving Day calendar_information.holiday_Veteran's Day calendar_information.holiday_Washington's Birthday calendar_information.type_of_day_WEEKDAY calendar_information.type_of_day_WEEKEND season_Fall season_Spring season_Summer season_Winter
calendar.calendar_day_date
2022-06-17 25 2168.2920 861.6930 722.7745 1425.1020 24535 2022 6 17 2022.0 ... 0 0 0 0 1 0 0 0 1 0
2022-06-22 25 2051.5635 808.0275 730.4850 1436.2740 24535 2022 6 22 2022.0 ... 0 0 0 0 1 0 0 0 1 0
2022-06-23 25 2257.5000 966.4410 895.7970 1594.3725 24535 2022 6 23 2022.0 ... 0 0 0 0 1 0 0 0 1 0
2022-06-26 26 1520.5925 542.3250 584.2900 1124.9280 24535 2022 6 26 2022.0 ... 0 0 0 0 0 1 0 0 1 0
2022-06-27 26 1897.6930 771.4525 852.2605 1640.2540 24535 2022 6 27 2022.0 ... 0 0 0 0 1 0 0 0 1 0

5 rows × 127 columns

In [237]:
split_date = '2023-01-01'
inside_train = encoded_df['daily_yoy_ndt.total_inside_sales'].loc[encoded_df.index <= split_date].copy()
inside_test = encoded_df['daily_yoy_ndt.total_inside_sales'].loc[encoded_df.index > split_date].copy()

# Plot train and test so you can see where we have split
(inside_test
    .rename_axis('Date')
    .rename('TEST SET')
    .to_frame()
    .join(inside_train.rename_axis('Date').rename('TRAINING SET').to_frame(),
          how='outer')
    .plot(figsize=(10, 5), title='Inside Sales', style='.', ms=2)
)
plt.show()
In [238]:
inside_train_prophet= inside_train.reset_index().rename(columns={'calendar.calendar_day_date':'ds','daily_yoy_ndt.total_inside_sales':'y'})
In [239]:
inside_train_prophet.head()
Out[239]:
ds y
0 2022-06-17 2168.2920
1 2022-06-22 2051.5635
2 2022-06-23 2257.5000
3 2022-06-26 1520.5925
4 2022-06-27 1897.6930
In [240]:
model= Prophet()
model.fit(inside_train_prophet)
INFO:prophet:Disabling yearly seasonality. Run prophet with yearly_seasonality=True to override this.
INFO:prophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.
INFO:cmdstanpy:start chain 1
INFO:cmdstanpy:finish chain 1
Out[240]:
<prophet.forecaster.Prophet at 0x2318fb62860>
In [241]:
inside_test_prophet= inside_test.reset_index().rename(columns={'calendar.calendar_day_date':'ds','daily_yoy_ndt.total_inside_sales':'y'})


inside_sales_tst_forecast= model.predict(inside_test_prophet)
In [242]:
inside_sales_tst_forecast.head()
Out[242]:
ds trend yhat_lower yhat_upper trend_lower trend_upper additive_terms additive_terms_lower additive_terms_upper weekly weekly_lower weekly_upper multiplicative_terms multiplicative_terms_lower multiplicative_terms_upper yhat
0 2023-01-02 2557.837513 1331.937640 3694.838319 2557.837513 2557.837513 -64.822492 -64.822492 -64.822492 -64.822492 -64.822492 -64.822492 0.0 0.0 0.0 2493.015021
1 2023-01-02 2557.837513 1315.813793 3673.458670 2557.837513 2557.837513 -64.822492 -64.822492 -64.822492 -64.822492 -64.822492 -64.822492 0.0 0.0 0.0 2493.015021
2 2023-01-02 2557.837513 1249.708291 3649.210257 2557.837513 2557.837513 -64.822492 -64.822492 -64.822492 -64.822492 -64.822492 -64.822492 0.0 0.0 0.0 2493.015021
3 2023-01-02 2557.837513 1324.023702 3639.662733 2557.837513 2557.837513 -64.822492 -64.822492 -64.822492 -64.822492 -64.822492 -64.822492 0.0 0.0 0.0 2493.015021
4 2023-01-02 2557.837513 1285.755085 3675.970048 2557.837513 2557.837513 -64.822492 -64.822492 -64.822492 -64.822492 -64.822492 -64.822492 0.0 0.0 0.0 2493.015021
In [243]:
fig, ax = plt.subplots(figsize=(10, 5))
fig = model.plot(inside_sales_tst_forecast, ax=ax)
plt.show()
In [244]:
fig = model.plot_components(inside_sales_tst_forecast)
plt.show()
In [245]:
f, ax = plt.subplots(figsize=(15,5))
ax.scatter(inside_test.index, inside_test, color='#FFB6C1')  
fig = model.plot(inside_sales_tst_forecast, ax=ax)
In [246]:
f, ax = plt.subplots(figsize=(15,5))
ax.scatter(inside_test.index, inside_test, color='r')  
fig = model.plot(inside_sales_tst_forecast, ax=ax)
ax.set_xlim(left=pd.to_datetime('2023-01-01'), right=pd.to_datetime('2023-01-31')) 
ax.set_ylim(bottom=0, top=6000) 
plt.suptitle('January 2023 Forecast vs Actual')  
plt.show()
In [247]:
#Jan 2023 first week forecasts in detail
f, ax = plt.subplots(figsize=(15,5))
ax.scatter(inside_test.index, inside_test, color='r')  
fig = model.plot(inside_sales_tst_forecast, ax=ax)
ax.set_xlim(left=pd.to_datetime('2023-01-01'), right=pd.to_datetime('2023-01-07')) 
ax.set_ylim(bottom=0, top=6000) 
plt.suptitle('January 2023 Forecast vs Actual')  
plt.show()
In [248]:
#Evaluate the model with error metrics 
from sklearn.metrics import mean_squared_error
np.sqrt(mean_squared_error(y_true=inside_test,y_pred=inside_sales_tst_forecast['yhat']))
Out[248]:
1137.1203790876662
In [249]:
def mean_absolute_percentage_error(y_true, y_pred):
    y_true, y_pred = np.array(y_true), np.array(y_pred)
    return np.mean(np.abs((y_true - y_pred) / y_true)) * 100

mean_absolute_percentage_error(y_true=inside_test, y_pred=inside_sales_tst_forecast['yhat'])
Out[249]:
30.17504340822068
In [250]:
future= model.make_future_dataframe(periods=730)
forecast=model.predict(future)
In [251]:
fig = model.plot(forecast, figsize=(15, 5))
plt.title('Prophet Forecast')
plt.xlabel('Date')
plt.ylabel('Value')
plt.show()
In [252]:
forecast[['ds','yhat']].tail(100)
Out[252]:
ds yhat
1350 2024-09-23 -71.768785
1351 2024-09-24 -53.885231
1352 2024-09-25 48.261293
1353 2024-09-26 197.847929
1354 2024-09-27 492.378850
... ... ...
1445 2024-12-27 121.910078
1446 2024-12-28 -485.743784
1447 2024-12-29 -1002.144657
1448 2024-12-30 -470.735155
1449 2024-12-31 -452.851601

100 rows × 2 columns

In our analysis using the Prophet model, we've noticed that the one-year forecast for the Inside Sales target variable exhibits a decreasing trend, which contradicts the actual values represented in light red, indicating a potential overfitting issue. This discrepancy between predicted and actual values can be attributed to certain limitations of the Prophet model that became evident during our dataset testing.

Some of the notable limitations of the Prophet model include its simplicity in handling complex interactions and dependencies within the time series data. The model's approach may not effectively capture intricate relationships among various influencing factors, leading to suboptimal predictions.

Additionally, Prophet makes certain assumptions about seasonality, primarily focusing on weekly and yearly patterns. However, if the actual seasonal patterns in the data follow a different trajectory, the model may struggle to accurately account for these variations.

4. Model Performance

XGBoost Model Performance Food Sales Mean Absolute Error (MAE): 2.51 Mean Squared Error (MSE): 34.41

Inside Sales Mean Absolute Error (MAE): 6.68 Mean Squared Error (MSE): 423.45

Diesel: Mean Absolute Error (MAE): 12.91 Mean Squared Error (MSE): 2676.26

Unleaded: Mean Absolute Error (MAE): 5.83 Mean Squared Error (MSE): 166.44

The XGBoost model's performance across the four different target variables, namely Food Sales, Inside Sales, Diesel, and Unleaded, reveals interesting insights. Firstly, when it comes to the Mean Absolute Error (MAE), we observe that the Food Sales and Unleaded categories have the lowest errors, with MAE values of 2.51 and 5.83, respectively. This indicates that the model's predictions for these two variables are relatively close to the actual values. In contrast, the Inside Sales and Diesel categories have higher MAE values of 6.68 and 12.91, respectively, suggesting more significant prediction errors.

Additionally, the Mean Squared Error (MSE) provides a measure of the model's predictive accuracy, where lower values indicate better performance. In this context, Food Sales and Unleaded again stand out with the lowest MSE values of 34.41 and 166.44, respectively. Conversely, Inside Sales and Diesel have considerably higher MSE values of 423.45 and 2676.26, implying that the model's predictions for these variables deviate more from the actual values.

Given these results, XGBoost appears to be best suited for predicting Food Sales and Unleaded fuel sales. These two variables exhibit the lowest MAE and MSE, indicating the model's better performance and accuracy in predicting them. However, when it comes to Inside Sales and Diesel, the higher MAE and MSE values suggest that the model may not perform as well for these variables, and alternative modeling approaches may be more appropriate. It's essential to consider the specific context and business goals to determine which target variable is of higher importance and, consequently, which one the XGBoost model is best suited for. The XGBoost model delivered strong performance during training, but when applied to the test set, we observed some discrepancies in the predictions for Inside Sales and Diesel. Consequently, we are actively exploring alternative models that can offer improved predictive accuracy for these specific target variables.

5. Results

In our analysis, we've employed both the Prophet and XGBoost models to forecast various target variables, shedding light on their respective strengths and limitations. The Prophet model exhibited a notable drawback in its one-year forecast for the Inside Sales target variable, where it portrayed a decreasing trend that contradicted the actual values, indicating a potential overfitting issue. This discrepancy can be attributed to the model's inherent limitations, particularly its simplistic approach in capturing complex interactions and dependencies within time series data. Additionally, the model's assumptions about seasonality, primarily focusing on weekly and yearly patterns, may lead to inaccuracies when the actual data follows a different seasonal trajectory.

Conversely, the XGBoost model's performance across four different target variables, namely Food Sales, Inside Sales, Diesel, and Unleaded, showcased intriguing results. It revealed that the Food Sales and Unleaded categories had the lowest Mean Absolute Error (MAE) and Mean Squared Error (MSE) values, indicating the model's ability to predict these variables with high accuracy. In contrast, Inside Sales and Diesel displayed higher MAE and MSE values, implying a more significant divergence between predicted and actual values.

While XGBoost demonstrates improved performance compared to Prophet, it's essential to note that the high R2 value suggests potential overfitting in the model. We are currently exploring ways to fine-tune the model to achieve more optimal and reliable results.

6. Contributions

  1. Krishna Chaitanya: XgBoost Model(Model built for Inside Sales), Correlation Matrix, Seasonality, Analysis of unleaded and diesel sales, Results

  2. Anukriti Raj: Arimax Model, XgBoost Model(Model built for Food Sales) Feature Engineering ,Seasonality, Correlation Matrix,Analysis of unleaded and diesel sales, Results

  3. Shivi Shrivastav: Introduction, XgBoost Model(Model built for Unleaded Sales) Table of Contents, Seasonality, Analysis of in-store and food sales sales, Results

  4. Litzy Carbajal: Introduction, XgBoost Model(Model built for Diesel Sales) ,Seasonality, Analysis of in-store and food sales, Results